What is the Databricks Lakehouse?

The Databricks Lakehouse combines the ACID transactions and data governance of data warehouses with the flexibility and cost-efficiency of data lakes to enable business intelligence (BI) and machine learning (ML) on all data. The Databricks Lakehouse keeps your data in your massively scalable cloud object storage in open source data standards, allowing you to use your data however and wherever you want.

Components of the Databricks Lakehouse

The primary components of the Databricks Lakehouse are:

By storing data with Delta Lake, you enable downstream data scientists, analysts, and machine learning engineers to leverage the same production data supporting your core ETL workloads as soon as data is processed.

Delta tables

Tables created on Databricks use the Delta Lake protocol by default. When you create a new Delta table:

  • Metadata used to reference the table is added to the metastore in the declared schema or database.

  • Data and table metadata are saved to a directory in cloud object storage.

The metastore reference to a Delta table is technically optional; you can create Delta tables by directly interacting with directory paths using Spark APIs. Some new features that build upon Delta Lake will store additional metadata in the table directory, but all Delta tables have:

  • A directory containing table data in the Parquet file format.

  • A sub-directory /_delta_log that contains metadata about table versions in JSON and Parquet format.

Learn more about _.

Data lakehouse vs. data warehouse vs. data lake

Data warehouses have powered business intelligence (BI) decisions for about 30 years, having evolved as set of design guidelines for systems controlling the flow of data. Data warehouses optimize queries for BI reports, but can take minutes or even hours to generate results. Designed for data that is unlikely to change with high frequency, data warehouses seek to prevent conflicts between concurrently running queries. Many data warehouses rely on proprietary formats, which often limit support for machine learning.

Powered by technological advances in data storage and driven by exponential increases in the types and volume of data, data lakes have come into widespread use over the last decade. Data lakes store and process data cheaply and efficiently. Data lakes are often defined in opposition to data warehouses: A data warehouse delivers clean, structured data for BI analytics, while a data lake permanently and cheaply stores data of any nature in any format. Many organizations use data lakes for data science and machine learning, but not for BI reporting due to its unvalidated nature.

The data lakehouse replaces the current dependency on data lakes and data warehouses for modern data companies that desire:

  • Open, direct access to data stored in standard data formats.

  • Indexing protocols optimized for machine learning and data science.

  • Low query latency and high reliability for BI and advanced analytics.

By combining an optimized metadata layer with validated data stored in standard formats in cloud object storage, the data lakehouse allows data scientists and ML engineers to build models from the same data driving BI reports.