What is data transformation on Databricks?
Data transformation is the process of converting, cleansing, and structuring data into a usable format. Data transformation typically follows the Databricks medallion architecture of incrementally refining data from raw into a format consumable by the business.
The following diagram shows a data pipeline containing a series of data transformations that turn the raw_customers
dataset into the clean_customers
dataset by dropping customer data with no customer name in this example. The raw_transactions
data is turned into clean_transactions
by dropping transactions with a zero dollar value. A resulting dataset called sales_report
is the joining the clean_customers
and clean_transactions
. Analysts can use sales_report
for analytics and business intelligence.
Types of data transformations
Databricks considers two types of data transformations: declarative and procedural. The data pipeline in the preceding example can be expressed using either paradigm.
Declarative transformations focus on the desired outcome rather than how to achieve it. You specify the logic of the transformation using higher-level abstractions, and Delta Live Tables determines the most efficient way to execute it.
Procedural data transformations focus on performing computations through explicit instructions. Those computations define the exact sequence of operations to manipulate the data. The procedural approach provides more control over execution but at the cost of greater complexity and higher maintenance.
Choosing between declarative and procedural data transformation
Declarative data transformation using Delta Live Tables is best when:
You require rapid development and deployment.
Your data pipelines have standard patterns that do not require low-level control over execution.
You need built-in data quality checks.
Maintenance and readability are top priorities.
Procedural data transformation using Apache Spark code is best when:
You are migrating an existing Apache Spark codebase to Databricks.
You need fine-grained control over execution.
You need access to low-level APIs such as
MERGE
orforeachBatch
.You need to write data to Kafka or external Delta tables.