This article provides an introduction to migrating existing data applications to Databricks. Databricks provides a unified approach that lets you work with data from many source systems on a single platform.
For an overview of platform capabilities, see What is Databricks?.
You can migrate Apache Spark jobs used to extract, transform, and load data from on-prem or cloud-native implementations to Databricks with just a few steps. See Adapt your exisiting Apache Spark code for Databricks.
Databricks extends the functionality of Spark SQL with pre-configured open source integrations, partner integrations, and enterprise product offerings. If your ETL workloads are written in SQL or Hive, you can migrate to Databricks with minimal refactoring. Learn more about Databricks SQL offerings:
For specific instructions on migrating from various source systems to Databricks, see Migrate ETL pipelines to Databricks.
Databricks provides optimal value and performance when workloads align around data stored in the lakehouse. Many enterprise data stacks include both a data lake and a data warehouse, and organizations create complex ETL workflows to try to keep these systems in sync with accurate and timely data. The lakehouse allows you to leverage the same data stored in the data lake in queries and systems that usually rely on a data warehouse. For more on the lakehouse, see What is the Databricks Lakehouse?.
Migrating from a data warehouse to the lakehouse generally involves reducing the complexity of your data architecture and workflows, but there are some caveats and best practices to keep in mind while completing this work. See Migrate your enterprise data warehouse to the Databricks Lakehouse.
Because the lakehouse provides optimized access to cloud-based data files through table queries or file paths, you can do ML, data science, and analytics on a single copy of your data. Databricks makes it easy to move workloads from both open source and proprietary tools, and maintains updated versions of many of open source libraries used by analysts and data scientists.
Pandas workloads in Jupyter notebooks can be synced and run using Databricks Repos. Databricks provides native support for pandas in all Databricks Runtime versions, and configures many popular ML and deep learning libraries in the Databricks ML Runtime. If you sync your local workloads using Git and Files in Repos, you can use the same relative paths for data and custom libaries present in your local environment.
By default, Databricks maintains
.ipynb extensions for Jupyter notebooks synced with Databricks Repos, but automatically converts Jupyter notebooks to Databricks notebooks when imported with the UI. Databricks notebooks save with a
.py extension, and so can live side-by-side with Jupyter notebooks in a Git repository.