Databricks for Python developers

This section provides a guide to developing notebooks and jobs in Databricks using the Python language, including tutorials for common workflows and tasks, and links to APIs, libraries, and tools.

To get started:

Tutorials

The below tutorials provide example code and notebooks to learn about common workflows. See Import a notebook for instructions on importing notebook examples into your workspace.

Data engineering

Data science and machine learning

Debug in Python notebooks

The example notebook illustrates how to use the Python debugger (pdb) in Databricks notebooks. To use the Python debugger, you must be running Databricks Runtime 11.3 LTS or above.

With Databricks Runtime 12.2 LTS and above, you can use variable explorer to track the current value of Python variables in the notebook UI. You can use variable explorer to observe the values of Python variables as you step through breakpoints.

Python debugger example notebook

Open notebook in new tab

Note

breakpoint() is not supported in IPython and thus does not work in Databricks notebooks. You can use import pdb; pdb.set_trace() instead of breakpoint().

Python APIs

Python code that runs outside of Databricks can generally run within Databricks, and vice versa. If you have existing code, just import it into Databricks to get started. See Manage code with notebooks and Databricks Git folders below for details.

Databricks can run both single-machine and distributed Python workloads. For single-machine computing, you can use Python APIs and libraries as usual; for example, pandas and scikit-learn will “just work.” For distributed Python workloads, Databricks offers two popular APIs out of the box: PySpark and Pandas API on Spark.

PySpark API

PySpark is the official Python API for Apache Spark and combines the power of Python and Apache Spark. PySpark is more flexibility than the Pandas API on Spark and provides extensive support and features for data science and engineering functionality such as Spark SQL, Structured Streaming, MLLib, and GraphX.

Pandas API on Spark

Note

The Koalas open-source project now recommends switching to the Pandas API on Spark. The Pandas API on Spark is available on clusters that run Databricks Runtime 10.0 (EoS) and above. For clusters that run Databricks Runtime 9.1 LTS and below, use Koalas instead.

pandas is a Python package commonly used by data scientists for data analysis and manipulation. However, pandas does not scale out to big data. Pandas API on Spark fills this gap by providing pandas-equivalent APIs that work on Apache Spark. This open-source API is an ideal choice for data scientists who are familiar with pandas but not Apache Spark.

Manage code with notebooks and Databricks Git folders

Databricks notebooks support Python. These notebooks provide functionality similar to that of Jupyter, but with additions such as built-in visualizations using big data, Apache Spark integrations for debugging and performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by importing a notebook. Once you have access to a cluster, you can attach a notebook to the cluster and run the notebook.

Tip

To completely reset the state of your notebook, it can be useful to restart the iPython kernel. For Jupyter users, the “restart kernel” option in Jupyter corresponds to detaching and re-attaching a notebook in Databricks. To restart the kernel in a Python notebook, click the compute selector in the notebook toolbar and hover over the attached cluster or SQL warehouse in the list to display a side menu. Select Detach & re-attach. This detaches the notebook from your cluster and reattaches it, which restarts the Python process.

Databricks Git folders allow users to synchronize notebooks and other files with Git repositories. Databricks Git folders help with code versioning and collaboration, and it can simplify importing a full repository of code into Databricks, viewing past notebook versions, and integrating with IDE development. Get started by cloning a remote Git repository. You can then open or create notebooks with the repository clone, attach the notebook to a cluster, and run the notebook.

Clusters and libraries

Databricks compute provides compute management for clusters of any size: from single node clusters up to large clusters. You can customize cluster hardware and libraries according to your needs. Data scientists will generally begin work either by creating a cluster or using an existing shared cluster. Once you have access to a cluster, you can attach a notebook to the cluster or run a job on the cluster.

Databricks clusters use a Databricks Runtime, which provides many popular libraries out-of-the-box, including Apache Spark, Delta Lake, pandas, and more. You can also install additional third-party or custom Python libraries to use with notebooks and jobs.

Visualizations

Databricks Python notebooks have built-in support for many types of visualizations. You can also use legacy visualizations.

You can also visualize data using third-party libraries; some are pre-installed in the Databricks Runtime, but you can install custom libraries as well. Popular options include:

Jobs

You can automate Python workloads as scheduled or triggered jobs in Databricks. Jobs can run notebooks, Python scripts, and Python wheel files.

Tip

To schedule a Python script instead of a notebook, use the spark_python_task field under tasks in the body of a create job request.

Machine learning

Databricks supports a wide variety of machine learning (ML) workloads, including traditional ML on tabular data, deep learning for computer vision and natural language processing, recommendation systems, graph analytics, and more. For general information about machine learning on Databricks, see AI and machine learning on Databricks.

For ML algorithms, you can use pre-installed libraries in Databricks Runtime for Machine Learning, which includes popular Python tools such as scikit-learn, TensorFlow, Keras, PyTorch, Apache Spark MLlib, and XGBoost. You can also install custom libraries.

For machine learning operations (MLOps), Databricks provides a managed service for the open source library MLflow. With MLflow Tracking you can record model development and save models in reusable formats. You can use the MLflow Model Registry to manage and automate the promotion of models towards production. Jobs allows hosting models as batch and streaming jobs. For more information and examples, see the ML lifecycle management using MLflow or the MLflow Python API docs.

To get started with common machine learning workloads, see the following pages:

IDEs, developer tools, and SDKs

In addition to developing Python code within Databricks notebooks, you can develop externally using integrated development environments (IDEs) such as PyCharm, Jupyter, and Visual Studio Code. To synchronize work between external development environments and Databricks, there are several options:

  • Code: You can synchronize code using Git. See Git integration for Databricks Git folders.

  • Libraries and Jobs: You can create libraries (such as Python wheel files) externally and upload them to Databricks. Those libraries may be imported within Databricks notebooks, or they can be used to create jobs. See Libraries and Schedule and orchestrate workflows.

  • Remote machine execution: You can run code from your local IDE for interactive development and testing. The IDE can communicate with Databricks to execute Apache Spark and large computations on Databricks clusters. See Databricks Connect.

Databricks provides a set of SDKs, including a Python SDK, that support automation and integration with external tooling. You can use the Databricks SDKs to manage resources like clusters and libraries, code and other workspace objects, workloads and jobs, and more. See the Databricks SDKs.

For more information on IDEs, developer tools, and SDKs, see Developer tools.

Additional resources