Databricks for Python developers
This section provides a guide to developing notebooks and jobs in Databricks using the Python language, including tutorials for common workflows and tasks, and links to APIs, libraries, and tools.
To get started:
Import code: Either import your own code from files or Git repos or try a tutorial listed below. Databricks recommends learning using interactive Databricks Notebooks.
Run your code on a cluster: Either create a cluster of your own, or ensure you have permissions to use a shared cluster. Attach your notebook to the cluster, and run the notebook.
Then you can:
Work with larger data sets using Apache Spark
Automate your workload as a job
Use machine learning to analyze your data
Tutorials
The below tutorials provide example code and notebooks to learn about common workflows. See Import a notebook for instructions on importing notebook examples into your workspace.
Data engineering
Tutorial: Load and transform data using Apache Spark DataFrames provides a walkthrough to help you learn about Apache Spark DataFrames for data preparation and analytics.
Data science and machine learning
Getting started with Apache Spark DataFrames for data preparation and analytics: Tutorial: Load and transform data using Apache Spark DataFrames
Tutorial: End-to-end ML models on Databricks. For additional examples, see Tutorials: Get started with AI and machine learning and the MLflow guide’s Quickstart Python.
AutoML lets you get started quickly with developing machine learning models on your own datasets. Its glass-box approach generates notebooks with the complete machine learning workflow, which you may clone, modify, and rerun.
Debug in Python notebooks
The example notebook illustrates how to use the Python debugger (pdb) in Databricks notebooks. To use the Python debugger, you must be running Databricks Runtime 11.3 LTS or above.
With Databricks Runtime 12.2 LTS and above, you can use variable explorer to track the current value of Python variables in the notebook UI. You can use variable explorer to observe the values of Python variables as you step through breakpoints.
Note
breakpoint()
is not supported in IPython and thus does not work in Databricks notebooks. You can use import pdb; pdb.set_trace()
instead of breakpoint()
.
Python APIs
Python code that runs outside of Databricks can generally run within Databricks, and vice versa. If you have existing code, just import it into Databricks to get started. See Manage code with notebooks and Databricks Git folders below for details.
Databricks can run both single-machine and distributed Python workloads. For single-machine computing, you can use Python APIs and libraries as usual; for example, pandas and scikit-learn will “just work.” For distributed Python workloads, Databricks offers two popular APIs out of the box: PySpark and Pandas API on Spark.
PySpark API
PySpark is the official Python API for Apache Spark and combines the power of Python and Apache Spark. PySpark is more flexibility than the Pandas API on Spark and provides extensive support and features for data science and engineering functionality such as Spark SQL, Structured Streaming, MLLib, and GraphX.
Pandas API on Spark
Note
The Koalas open-source project now recommends switching to the Pandas API on Spark. The Pandas API on Spark is available on clusters that run Databricks Runtime 10.0 (EoS) and above. For clusters that run Databricks Runtime 9.1 LTS and below, use Koalas instead.
pandas is a Python package commonly used by data scientists for data analysis and manipulation. However, pandas does not scale out to big data. Pandas API on Spark fills this gap by providing pandas-equivalent APIs that work on Apache Spark. This open-source API is an ideal choice for data scientists who are familiar with pandas but not Apache Spark.
Manage code with notebooks and Databricks Git folders
Databricks notebooks support Python. These notebooks provide functionality similar to that of Jupyter, but with additions such as built-in visualizations using big data, Apache Spark integrations for debugging and performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by importing a notebook. Once you have access to a cluster, you can attach a notebook to the cluster and run the notebook.
Tip
To completely reset the state of your notebook, it can be useful to restart the iPython kernel. For Jupyter users, the “restart kernel” option in Jupyter corresponds to detaching and re-attaching a notebook in Databricks. To restart the kernel in a Python notebook, click the compute selector in the notebook toolbar and hover over the attached cluster or SQL warehouse in the list to display a side menu. Select Detach & re-attach. This detaches the notebook from your cluster and reattaches it, which restarts the Python process.
Databricks Git folders allow users to synchronize notebooks and other files with Git repositories. Databricks Git folders help with code versioning and collaboration, and it can simplify importing a full repository of code into Databricks, viewing past notebook versions, and integrating with IDE development. Get started by cloning a remote Git repository. You can then open or create notebooks with the repository clone, attach the notebook to a cluster, and run the notebook.
Clusters and libraries
Databricks compute provides compute management for clusters of any size: from single node clusters up to large clusters. You can customize cluster hardware and libraries according to your needs. Data scientists will generally begin work either by creating a cluster or using an existing shared cluster. Once you have access to a cluster, you can attach a notebook to the cluster or run a job on the cluster.
For small workloads which only require single nodes, data scientists can use single node compute for cost savings.
For detailed tips, see Compute configuration recommendations
Administrators can set up cluster policies to simplify and guide cluster creation.
Databricks clusters use a Databricks Runtime, which provides many popular libraries out-of-the-box, including Apache Spark, Delta Lake, pandas, and more. You can also install additional third-party or custom Python libraries to use with notebooks and jobs.
Start with the default libraries in the Databricks Runtime release notes versions and compatibility. Use Databricks Runtime for Machine Learning for machine learning workloads. For full lists of pre-installed libraries, see Databricks Runtime release notes versions and compatibility.
Customize your environment using Notebook-scoped Python libraries, which allow you to modify your notebook or job environment with libraries from PyPI or other repositories. The
%pip install my_library
magic command installsmy_library
to all nodes in your currently attached cluster, yet does not interfere with other workloads on shared clusters.Install non-Python libraries as Cluster libraries as needed.
For more details, see Libraries.
Visualizations
Databricks Python notebooks have built-in support for many types of visualizations. You can also use legacy visualizations.
You can also visualize data using third-party libraries; some are pre-installed in the Databricks Runtime, but you can install custom libraries as well. Popular options include:
Jobs
You can automate Python workloads as scheduled or triggered jobs in Databricks. Jobs can run notebooks, Python scripts, and Python wheel files.
Create and update jobs using the Databricks UI or the Databricks REST API.
The Databricks Python SDK allows you to create, edit, and delete jobs programmatically.
The Databricks CLI provides a convenient command line interface for automating jobs.
Tip
To schedule a Python script instead of a notebook, use the spark_python_task
field under tasks
in the body of a create job request.
Machine learning
Databricks supports a wide variety of machine learning (ML) workloads, including traditional ML on tabular data, deep learning for computer vision and natural language processing, recommendation systems, graph analytics, and more. For general information about machine learning on Databricks, see AI and machine learning on Databricks.
For ML algorithms, you can use pre-installed libraries in Databricks Runtime for Machine Learning, which includes popular Python tools such as scikit-learn, TensorFlow, Keras, PyTorch, Apache Spark MLlib, and XGBoost. You can also install custom libraries.
For machine learning operations (MLOps), Databricks provides a managed service for the open source library MLflow. With MLflow Tracking you can record model development and save models in reusable formats. You can use the MLflow Model Registry to manage and automate the promotion of models towards production. Jobs allows hosting models as batch and streaming jobs. For more information and examples, see the ML lifecycle management using MLflow or the MLflow Python API docs.
To get started with common machine learning workloads, see the following pages:
Training scikit-learn and tracking with MLflow: 10-minute tutorial: machine learning on Databricks with scikit-learn
Training deep learning models: Deep learning
Hyperparameter tuning: Parallelize Hyperopt hyperparameter tuning
Graph analytics: How to use GraphFrames on Databricks
IDEs, developer tools, and SDKs
In addition to developing Python code within Databricks notebooks, you can develop externally using integrated development environments (IDEs) such as PyCharm, Jupyter, and Visual Studio Code. To synchronize work between external development environments and Databricks, there are several options:
Code: You can synchronize code using Git. See Git integration for Databricks Git folders.
Libraries and Jobs: You can create libraries (such as Python wheel files) externally and upload them to Databricks. Those libraries may be imported within Databricks notebooks, or they can be used to create jobs. See Libraries and Schedule and orchestrate workflows.
Remote machine execution: You can run code from your local IDE for interactive development and testing. The IDE can communicate with Databricks to execute Apache Spark and large computations on Databricks clusters. See Databricks Connect.
Databricks provides a set of SDKs, including a Python SDK, that support automation and integration with external tooling. You can use the Databricks SDKs to manage resources like clusters and libraries, code and other workspace objects, workloads and jobs, and more. See the Databricks SDKs.
For more information on IDEs, developer tools, and SDKs, see Developer tools.
Additional resources
The Databricks Academy offers self-paced and instructor-led courses on many topics.
Databricks Labs provides tools for Python development in Databricks such as the pytest plugin and the pylint plugin.
Features that support interoperability between PySpark and pandas include the following:
Python and SQL database connectivity tools include:
The Databricks SQL Connector for Python allows you to use Python code to run SQL commands on Databricks resources.
pyodbc allows you to connect from your local Python code through ODBC to data stored in the Databricks lakehouse.
FAQs and tips for moving Python workloads to Databricks can be found in the Databricks Knowledge Base