Migrate your data warehouse to the Databricks lakehouse

This article describes some of the considerations and caveats to consider as you replace your enterprise data warehouse with the Databricks lakehouse. Most workloads, queries, and dashboards defined in enterprise data warehouses can run with minimal code refactoring once admins have completed the initial data migration and governance configuration. Migrating your data warehousing workloads to Databricks is not about eliminating data warehousing, but rather unifying your data ecosystem. For more on data warehousing on Databricks, see What is data warehousing on Databricks?.

Many Apache Spark workloads extract, transform, and load (ETL) data from source systems into data warehouses to power downstream analytics. Replacing your enterprise data warehouse with a lakehouse enables analysts, data scientists, and data engineers to work against the same tables in the same platform, reducing the overall complexity, maintenance requirements, and total cost of ownership. See What is a data lakehouse?. For more on data warehousing on Databricks, see What is data warehousing on Databricks?.

Load data into the lakehouse

Databricks provides a number of tools and capabilities to make it easy to migrate data to the lakehouse and configure ETL jobs to load data from diverse data sources. The following articles introduce these tools and options:

How is the Databricks Data Intelligence Platform different than an enterprise data warehouse?

The Databricks Data Intelligence Platform is built on top of Apache Spark, Unity Catalog, and Delta Lake, providing native support for big data workloads for analytics, ML, and data engineering. All enterprise data systems have slightly different transactional guarantees, indexing and optimization patterns, and SQL syntax. Some of the biggest differences you might discover include the following:

  • All transactions are table-level. There are no database-level transactions, locks, or guarantees.

  • There are no BEGIN and END constructs, meaning each statement or query runs as a separate transaction.

  • Three tier namespacing uses catalog.schema.table pattern. The terms database and schema are synonymous due to legacy Apache Spark syntax.

  • Primary key and foreign key constraints are informational only. Constraints can only be enforced at a table level. See Constraints on Databricks.

  • Native data types supported in Databricks and Delta Lake might differ slightly from source systems. Required precision for numeric types should be clearly indicated before target types are chosen.

The following articles provide additional context on important considerations: