Migrate a Parquet data lake to Delta Lake

This article provides recommendations for converting an existing Parquet data lake to Delta Lake. Delta Lake is the underlying format in the Databricks lakehouse. See What is Delta Lake?.

Considerations before converting to Delta Lake

Your Parquet data lake likely has a partitioning strategy that has been optimized for your existing workloads and systems. While you can convert to Delta Lake and maintain this partitioning structure, over-partitioned tables are one of the main culprits that cause slow workloads on Delta Lake. See When to partition tables on Databricks and guidelines for adapting Spark code to Databricks.

You also need to consider whether or not the data being converted is still growing, as well as how frequently data is currently being queried. You might choose different approaches for different Parquet tables in your data lake.

Approaches to Delta Lake conversion

The following matrix outlines the four main approaches to converting a Parquet data lake to Delta Lake and some of the trade-offs. To clarify each column:

  • Incremental: Denotes functionality that supports converting additional data appended to the conversion source after conversion has begun.

  • Duplicates data: Indicates whether data is written to a new location or modified in place.

  • Maintains data structure: Indicates whether the partitioning strategy is maintained during conversion.

  • Backfill data: Denotes functionality that supports backfilling data that has been added to the conversion source after conversion has begun.

  • Ease of use: Indicates the level of user effort to configure and run the data conversion.

Method

Incremental

Duplicates data

Maintains data structure

Backfill data

Ease of use

Deep CLONE Parquet

Yes

Yes

Yes

Yes

Easy

Shallow CLONE Parquet

Yes

No

Yes

Yes

Easy

CONVERT TO DELTA

No

No

Yes

No

Easy

Auto Loader

Yes

Yes

No

Optional

Some configuration

Batch Spark job

Custom logic

Yes

No

Custom logic

Custom logic

The following sections discuss each of these options in greater depth.

Migrate Parquet data with CLONE Parquet

You can use CLONE Parquet to incrementally copy data from a Parquet data lake to Delta Lake. Shallow clones create pointers to existing Parquet files, maintaining your Parquet table in its original location and format while providing optimized access through collected file statistics. You can write to the table created by a shallow clone without impacting the original data source.

Deep clone copies all data files from the source to a new location while converting to Delta Lake. Deep clone allows you to incrementally detect new files, including backfill operations, on subsequent execution of the logic. See Incrementally clone Parquet and Iceberg tables to Delta Lake.

The following example demonstrates using CLONE:

CREATE OR REPLACE TABLE <target-table-name> [SHALLOW] CLONE parquet.`/path/to/data`;

Migrate Parquet data with CONVERT TO DELTA

You can use CONVERT TO DELTA to transform a directory of Parquet files into a Delta table with a single command. Once you have converted a table to Delta Lake, you should stop reading and writing from the table using Parquet logic. Data written to the target directory after conversion has started might not be reflected in the resultant Delta table. See Convert to Delta Lake.

The follow example demonstrates using CONVERT TO DELTA:

CONVERT TO DELTA parquet.`gcs://my-bucket/parquet-data`;

Migrate Parquet data with Auto Loader

While Auto Loader is a product designed for incremental data ingestion from cloud object storage, you can leverage it to implement a pattern that incrementally copies all data from a given directory to a target table. See What is Auto Loader?.

The following code example includes configurations that:

  • Process all existing files in the source directory.

  • Trigger an automatic weekly backfill job to capture files that might have been missed.

  • Allow Apache Spark to use many Spark jobs to avoid spill and out-of-memory errors associated with large data partitions.

  • Provide end-to-end exactly-once processing guarantees.

(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "parquet")
  .option("cloudFiles.includeExistingFiles", "true")
  .option("cloudFiles.backfillInterval", "1 week")
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .load(file_path)
  .writeStream
  .option("checkpointLocation", checkpoint_path)
  .trigger(availableNow=True)
  .toTable(table_name)
)

You can use Auto Loader in Delta Live Tables with either Python or SQL:

Migrate Parquet data with custom Apache Spark batch logic

Writing custom Apache Spark logic provides great flexibility in controlling how and when different data from your source system is migrated, but might require extensive configuration to provide capabilities built into other approaches.

At the heart of this approach is a simple Apache Spark read and write operation, such as the following:

spark.read.format("parquet").load(file_path).write.mode("append").saveAsTable(table_name)

To perform backfills or incremental migration, you might be able to rely on the partitioning structure of your data source, but might also need to write custom logic to track which files have been added since you last loaded data from the source. While you can use Delta Lake merge capabilities to avoid writing duplicate records, comparing all records from a large Parquet source table to the contents of a large Delta table is a computationally expensive task.

When shouldn’t you convert to Delta Lake?

Before converting all your existing Parquet data to Delta Lake, you are likely to consider potential trade-offs.

Databricks designs many optimized features of the lakehouse around Delta Lake, and Delta Lake provides a rich open source ecosystem with native connectors for many languages and enterprise data systems. Delta Sharing extends the ability to share data stored with Delta Lake to other clients.

Delta Lake is built on top of Parquet, and as such, Databricks also has optimized readers and writers for interacting with Parquet files.

Databricks recommends using Delta Lake for all tables that receive regular updates or queries from Databricks. You might choose to maintain data in Parquet format in some cases, such as the following:

  • An upstream system that writes data to Parquet does not support native writing to Delta Lake.

  • A downstream system that reads Parquet data cannot read Delta Lake.

In both of these cases, you might want to replicate your tables to Delta Lake to leverage performance benefits while reading, writing, updating, and deleting records in the table.

Note

Simultaneously modifying data in the same Delta table stored in S3 from multiple workspaces or data systems is not recommended. See Delta Lake limitations on S3.