This section provides a guide to developing notebooks and jobs in Databricks using the Python language. The first subsection provides links to tutorials for common workflows and tasks. The second subsection provides links to APIs, libraries, and key tools.
A basic workflow for getting started is:
Import code: Either import your own code from files or Git repos or try a tutorial listed below. Databricks recommends learning using interactive Databricks Notebooks.
Run your code on a cluster: Either create a cluster of your own, or ensure you have permissions to use a shared cluster. Attach your notebook to the cluster, and run the notebook.
Beyond this, you can branch out into more specific topics:
The below tutorials provide example code and notebooks to learn about common workflows. See Import a notebook for instructions on importing notebook examples into your workspace.
Getting started with Apache Spark DataFrames for data preparation and analytics: Tutorial: Load and transform data in PySpark DataFrames
Databricks AutoML lets you get started quickly with developing machine learning models on your own datasets. Its glass-box approach generates notebooks with the complete machine learning workflow, which you may clone, modify, and rerun.
Tutorial: Load and transform data in PySpark DataFrames provides a walkthrough to help you learn about Apache Spark DataFrames for data preparation and analytics.
The example notebook illustrates how to use the Python debugger (pdb) in Databricks notebooks. To use the Python debugger, you must be running Databricks Runtime 11.2 or above.
With Databricks Runtime 12.1 and above, you can use variable explorer to track the current value of Python variables in the notebook UI. You can use variable explorer to observe the values of Python variables as you step through breakpoints.
breakpoint() is not supported in IPython and thus does not work in Databricks notebooks. You can use
import pdb; pdb.set_trace() instead of
The below subsections list key features and tips to help you begin developing in Databricks with Python.
Python code that runs outside of Databricks can generally run within Databricks, and vice versa. If you have existing code, just import it into Databricks to get started. See Manage code with notebooks and Databricks Repos below for details.
Databricks can run both single-machine and distributed Python workloads. For single-machine computing, you can use Python APIs and libraries as usual; for example, pandas and scikit-learn will “just work.” For distributed Python workloads, Databricks offers two popular APIs out of the box: the Pandas API on Spark and PySpark.
The Koalas open-source project now recommends switching to the Pandas API on Spark. The Pandas API on Spark is available on clusters that run Databricks Runtime 10.0 (unsupported) and above. For clusters that run Databricks Runtime 9.1 LTS and below, use Koalas instead.
pandas is a Python package commonly used by data scientists for data analysis and manipulation. However, pandas does not scale out to big data. Pandas API on Spark fills this gap by providing pandas-equivalent APIs that work on Apache Spark. This open-source API is an ideal choice for data scientists who are familiar with pandas but not Apache Spark.
Databricks notebooks support Python. These notebooks provide functionality similar to that of Jupyter, but with additions such as built-in visualizations using big data, Apache Spark integrations for debugging and performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by importing a notebook. Once you have access to a cluster, you can attach a notebook to the cluster and run the notebook.
To completely reset the state of your notebook, it can be useful to restart the iPython kernel. For Jupyter users, the “restart kernel” option in Jupyter corresponds to detaching and re-attaching a notebook in Databricks. To restart the kernel in a Python notebook, click the compute selector in the notebook toolbar and hover over the attached cluster or SQL warehouse in the list to display a side menu. Select Detach & re-attach. This detaches the notebook from your cluster and reattaches it, which restarts the Python process.
Databricks Repos allows users to synchronize notebooks and other files with Git repositories. Databricks Repos helps with code versioning and collaboration, and it can simplify importing a full repository of code into Databricks, viewing past notebook versions, and integrating with IDE development. Get started by cloning a remote Git repository. You can then open or create notebooks with the repository clone, attach the notebook to a cluster, and run the notebook.
Databricks Compute provide compute management for clusters of any size: from single node clusters up to large clusters. You can customize cluster hardware and libraries according to your needs. Data scientists will generally begin work either by creating a cluster or using an existing shared cluster. Once you have access to a cluster, you can attach a notebook to the cluster or run a job on the cluster.
For small workloads which only require single nodes, data scientists can use Single Node clusters for cost savings.
For detailed tips, see Best practices: Cluster configuration
Administrators can set up cluster policies to simplify and guide cluster creation.
Databricks clusters use a Databricks Runtime, which provides many popular libraries out-of-the-box, including Apache Spark, Delta Lake, pandas, and more. You can also install additional third-party or custom Python libraries to use with notebooks and jobs.
Start with the default libraries in the Databricks Runtime release notes versions and compatibility. Use Databricks Runtime for Machine Learning for machine learning workloads. For full lists of pre-installed libraries, see Databricks Runtime release notes versions and compatibility.
Customize your environment using Notebook-scoped Python libraries, which allow you to modify your notebook or job environment with libraries from PyPI or other repositories. The
%pip install my_librarymagic command installs
my_libraryto all nodes in your currently attached cluster, yet does not interfere with other workloads on shared clusters.
Install non-Python libraries as Cluster libraries as needed.
For more details, see Libraries.
You can also visualize data using third-party libraries; some are pre-installed in the Databricks Runtime, but you can install custom libraries as well. Popular options include:
You can automate Python workloads as scheduled or triggered Create and run Databricks Jobs in Databricks. Jobs can run notebooks, Python scripts, and Python wheels.
For details on creating a job via the UI, see Create a job.
The Databricks SDKs allow you to create, edit, and delete jobs programmatically.
The Databricks CLI provides a convenient command line interface for automating jobs.
To schedule a Python script instead of a notebook, use the
spark_python_task field under
tasks in the body of a create job request.
Databricks supports a wide variety of machine learning (ML) workloads, including traditional ML on tabular data, deep learning for computer vision and natural language processing, recommendation systems, graph analytics, and more. For general information about machine learning on Databricks, see AI and Machine Learning on Databricks.
For ML algorithms, you can use pre-installed libraries in Databricks Runtime for Machine Learning, which includes popular Python tools such as scikit-learn, TensorFlow, Keras, PyTorch, Apache Spark MLlib, and XGBoost. You can also install custom libraries.
For machine learning operations (MLOps), Databricks provides a managed service for the open source library MLflow. MLflow Tracking lets you record model development and save models in reusable formats; the MLflow Model Registry lets you manage and automate the promotion of models towards production; and Jobs, allow hosting models as batch and streaming jobs. For more information and examples, see the MLflow guide or the MLflow Python API docs.
To get started with common machine learning workloads, see the following pages:
Training scikit-learn and tracking with MLflow: 10-minute tutorial: machine learning on Databricks with scikit-learn
Training deep learning models: Deep learning
Hyperparameter tuning: Parallelize hyperparameter tuning with scikit-learn and MLflow
Graph analytics: GraphFrames user guide - Python
In addition to developing Python code within Databricks notebooks, you can develop externally using integrated development environments (IDEs) such as PyCharm, Jupyter, and Visual Studio Code. To synchronize work between external development environments and Databricks, there are several options:
Code: You can synchronize code using Git. See Git integration with Databricks Repos.
Libraries and Jobs: You can create libraries (such as Python wheels) externally and upload them to Databricks. Those libraries may be imported within Databricks notebooks, or they can be used to create jobs. See Libraries and Create and run Databricks Jobs.
Remote machine execution: You can run code from your local IDE for interactive development and testing. The IDE can communicate with Databricks to execute Apache Spark and large computations on Databricks clusters. To learn to use Databricks Connect to create this connection, see Use IDEs with Databricks.
Databricks provides a set of SDKs which support automation and integration with external tooling. You can use the Databricks SDKs to manage resources like clusters and libraries, code and other workspace objects, workloads and jobs, and more. See the Databricks SDKs.
For more information on IDEs, developer tools, and SDKs, see Developer tools and guidance.
The Databricks Academy offers self-paced and instructor-led courses on many topics.
Features that support interoperability between PySpark and pandas
Python and SQL database connectivity
FAQs and tips for moving Python workloads to Databricks