Introduction to Delta Lake

One of the key constraints on making useful data-driven decisions is the structure, accessibility, and quality of the underlying data stores. For this reason it is important to have a well-planned strategy for data access for all end users.

One aspect of that strategy can be the data storage format provide by Delta Lake.

What is Delta Lake?

Delta Lake is a key component of the Databricks lakehouse architecture. The Delta table format is a widely-used standard for enterprise data lakes at massive scale. Built on the foundation of another open source format—Parquet—Delta Lake adds advanced features and capabilities that enable additional robustness, speed, versioning, and data-warehouse-like ACID compliance. This is on top of the existing cost benefits of using existing cheap blob storage services.

Databricks has built-in support for Delta Lake, and the latest Databricks Runtimes include performance enhancements for even more speed and performance.

See this presentation for a full discussion of Delta Lake and its capabilities: Making Apache Spark better with Delta Lake.

Data pipelines using Delta Lake and Delta Live Tables

When factored into your overall data strategy, data pipelines built on Delta Lake should follow a tiered multi-hop strategy. This is a successive pattern of data cleaning and transformation from raw ingest (bronze level) to semi-processed (silver level) to the most-processed, business-ready tables (gold level).

You can view a more thorough examination of this approach in this presentation: Simplify and Scale Data Engineering Pipelines.

Databricks also includes Delta Live Tables, a powerful framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations to perform on your data, and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling.

Instead of defining your data pipelines using a series of separate Apache Spark tasks, Delta Live Tables manages how your data is transformed based on a target schema you define for each processing step.

For an introduction, see Quickstart: Create data pipelines with Delta Live Tables

Overview and quickstarts

To get started with Delta Lake and Delta Live Tables see: