Get started: Enhance and cleanse data
This get started article walks you through using a Databricks notebook to cleanse and enhance the New York State baby name data that was previously loaded into a table in Unity Catalog by using Python, Scala, and R. In this article, you change column names, change capitalization, and spell out the sex of each baby name from the raw data table - and then save the DataFrame into a silver table. Then you filter the data to only include data for 2021, group the data at the state level, and then sort the data by count. Finally, you save this DataFrame into a gold table and visualize the data in a bar chart. For more information on silver and gold tables, see medallion architecture.
Important
This get started article builds on Get started: Ingest and insert additional data. You must complete the steps in that article to complete this article. For the complete notebook for that getting started article, see Ingest additional data notebooks.
Requirements
To complete the tasks in this article, you must meet the following requirements:
Your workspace must have Unity Catalog enabled. For information on getting started with Unity Catalog, see Set up and manage Unity Catalog.
You must have permission to use an existing compute resource or create a new compute resource. See Get started: Account and workspace setup or see your Databricks administrator.
Tip
For a completed notebook for this article, see Cleanse and enhance data notebooks.
Step 1: Create a new notebook
To create a notebook in your workspace, click New in the sidebar, and then click Notebook. A blank notebook opens in the workspace.
To learn more about creating and managing notebooks, see Manage notebooks.
Step 2: Define variables
In this step, you define variables for use in the example notebook you create in this article.
Copy and paste the following code into the new empty notebook cell. Replace
<catalog-name>
,<schema-name>
, and<volume-name>
with the catalog, schema, and volume names for a Unity Catalog volume. Replace<table_name>
with a table name of your choice. You will save the baby name data into this table later in this article.Press
Shift+Enter
to run the cell and create a new blank cell.catalog = "<catalog_name>" schema = "<schema_name>" table_name = "baby_names" silver_table_name = "baby_names_prepared" gold_table_name = "top_baby_names_2021" path_table = catalog + "." + schema print(path_table) # Show the complete path
val catalog = "<catalog_name>" val schema = "<schema_name>" val tableName = "baby_names" val silverTableName = "baby_names_prepared" val goldTableName = "top_baby_names_2021" val pathTable = s"${catalog}.${schema}" print(pathTable) // Show the complete path
catalog <- "<catalog_name>" schema <- "<schema_name>" volume <- "<volume_name>" table_name <- "baby_names" silver_table_name <- "baby_names_prepared" gold_table_name <- "top_baby_names_2021" path_table <- paste(catalog, ".", schema, sep = "") print(path_table) # Show the complete path
Step 3: Load the raw data into a new DataFrame
This step loads the raw data previously saved into a Delta table into a new DataFrame in preparation for cleansing and enhancing this data for further analysis.
Copy and paste the following code into the new empty notebook cell.
df_raw = spark.read.table(f"{path_table}.{table_name}") display(df_raw)
val dfRaw = spark.read.table(s"${pathTable}.${tableName}") display(dfRaw)
# Load the SparkR package that is already preinstalled on the cluster. library(SparkR) df_raw = sql(paste0("SELECT * FROM ", path_table, ".", table_name)) display(df_raw)
Press
Shift+Enter
to run the cell and then move to the next cell.
Step 4: Cleanse and enhance raw data and save
In this step, you change the name of the Year
column, change the data in the First_Name
column to initial capitals, and update the values for the sex to spell of the sex, and then save the DataFrame to a new table.
Copy and paste the following code into an empty notebook cell.
from pyspark.sql.functions import col, initcap, when # Rename "Year" column to "Year_Of_Birth" df_rename_year = df_raw.withColumnRenamed("Year", "Year_Of_Birth") # Change the case of "First_Name" column to initcap df_init_caps = df_rename_year.withColumn("First_Name", initcap(col("First_Name").cast("string"))) # Update column values from "M" to "male" and "F" to "female" df_baby_names_sex = df_init_caps.withColumn( "Sex", when(col("Sex") == "M", "Male") .when(col("Sex") == "F", "Female") ) # display display(df_baby_names_sex) # Save DataFrame to table df_baby_names_sex.write.mode("overwrite").saveAsTable(f"{path_table}.{silver_table_name}")
import org.apache.spark.sql.functions.{col, initcap, when} // Rename "Year" column to "Year_Of_Birth" val dfRenameYear = dfRaw.withColumnRenamed("Year", "Year_Of_Birth") // Change the case of "First_Name" data to initial caps val dfNameInitCaps = dfRenameYear.withColumn("First_Name", initcap(col("First_Name").cast("string"))) // Update column values from "M" to "Male" and "F" to "Female" val dfBabyNamesSex = dfNameInitCaps.withColumn("Sex", when(col("Sex") equalTo "M", "Male") .when(col("Sex") equalTo "F", "Female")) // Display the data display(dfBabyNamesSex) // Save DataFrame to a table dfBabyNamesSex.write.mode("overwrite").saveAsTable(s"${pathTable}.${silverTableName}")
# Rename "Year" column to "Year_Of_Birth" df_rename_year <- withColumnRenamed(df_raw, "Year", "Year_Of_Birth") # Change the case of "First_Name" data to initial caps df_init_caps <- withColumn(df_rename_year, "First_Name", initcap(df_rename_year$First_Name)) # Update column values from "M" to "Male" and "F" to "Female" df_baby_names_sex <- withColumn(df_init_caps, "Sex", ifelse(df_init_caps$Sex == "M", "Male", ifelse(df_init_caps$Sex == "F", "Female", df_init_caps$Sex))) # Display the data display(df_baby_names_sex) # Save DataFrame to a table saveAsTable(df_baby_names_sex, paste(path_table, ".", silver_table_name), mode = "overwrite")
Press
Shift+Enter
to run the cell and then move to the next cell.
Step 5: Group and visualize data
In this step, you filter the data to only the year 2021, group the data by sex and name, aggregate by count, and order by count. You then save the DataFrame to a table and then visualize the data in a bar chart.
Copy and paste the following code into an empty notebook cell.
from pyspark.sql.functions import expr, sum, desc from pyspark.sql import Window # Count of names for entire state of New York by sex df_baby_names_2021_grouped=(df_baby_names_sex .filter(expr("Year_Of_Birth == 2021")) .groupBy("Sex", "First_Name") .agg(sum("Count").alias("Total_Count")) .sort(desc("Total_Count"))) # Display data display(df_baby_names_2021_grouped) # Save DataFrame to a table df_baby_names_2021_grouped.write.mode("overwrite").saveAsTable(f"{path_table}.{gold_table_name}")
import org.apache.spark.sql.functions.{expr, sum, desc} import org.apache.spark.sql.expressions.Window // Count of male and female names for entire state of New York by sex val dfBabyNames2021Grouped = dfBabyNamesSex .filter(expr("Year_Of_Birth == 2021")) .groupBy("Sex", "First_Name") .agg(sum("Count").alias("Total_Count")) .sort(desc("Total_Count")) // Display data display(dfBabyNames2021Grouped) // Save DataFrame to a table dfBabyNames2021Grouped.write.mode("overwrite").saveAsTable(s"${pathTable}.${goldTableName}")
# Filter to only 2021 data df_baby_names_2021 <- filter(df_baby_names_sex, df_baby_names_sex$Year_Of_Birth == 2021) # Count of names for entire state of New York by sex df_baby_names_grouped <- agg( groupBy(df_baby_names_2021, df_baby_names_2021$Sex, df_baby_names_2021$First_Name), Total_Count = sum(df_baby_names_2021$Count) ) # Display data display(arrange(select(df_baby_names_grouped, df_baby_names_grouped$Sex, df_baby_names_grouped$First_Name, df_baby_names_grouped$Total_Count), desc(df_baby_names_grouped$Total_Count))) # Save DataFrame to a table saveAsTable(df_baby_names_2021_grouped, paste(path_table, ".", gold_table_name), mode = "overwrite")
Press
Ctrl+Enter
to run the cell.Next to the Table tab, click + and then click Visualization.
In the visualization editor, click Visualization Type, and verify that Bar is selected.
In the X column, select`First_Name`.
Click Add column under Y columns and then select Total_Count.
In Group by, select Sex.
Click Save.