Databricks for Python developers

This section provides a guide to developing notebooks and jobs in Databricks using the Python language. The first subsection provides links to tutorials for common workflows and tasks. The second subsection provides links to APIs, libraries, and key tools.

A basic workflow for getting started is:

Tutorials

The below tutorials provide example code and notebooks to learn about common workflows. See Import a notebook for instructions on importing notebook examples into your workspace.

Interactive data science and machine learning

Data engineering

Reference

The below subsections list key features and tips to help you begin developing in Databricks with Python.

Python APIs

Python code that runs outside of Databricks can generally run within Databricks, and vice versa. If you have existing code, just import it into Databricks to get started. See Manage code with notebooks and Databricks Repos below for details.

Databricks can run both single-machine and distributed Python workloads. For single-machine computing, you can use Python APIs and libraries as usual; for example, pandas and scikit-learn will “just work.” For distributed Python workloads, Databricks offers two popular APIs out of the box: the Pandas API on Spark and PySpark.

Pandas API on Spark

Note

The Koalas open-source project now recommends switching to the Pandas API on Spark. The Pandas API on Spark is available on clusters that run Databricks Runtime 10.0 (Unsupported) and above. For clusters that run Databricks Runtime 9.1 LTS and below, use Koalas instead.

pandas is a Python package commonly used by data scientists for data analysis and manipulation. However, pandas does not scale out to big data. Pandas API on Spark fills this gap by providing pandas-equivalent APIs that work on Apache Spark. This open-source API is an ideal choice for data scientists who are familiar with pandas but not Apache Spark.

PySpark API

PySpark is the official Python API for Apache Spark. This API provides more flexibility than the Pandas API on Spark. These links provide an introduction to and reference for PySpark.

Manage code with notebooks and Databricks Repos

Databricks notebooks support Python. These notebooks provide functionality similar to that of Jupyter, but with additions such as built-in visualizations using big data, Apache Spark integrations for debugging and performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by importing a notebook. Once you have access to a cluster, you can attach a notebook to the cluster and run the notebook.

Tip

To completely reset the state of your notebook, it can be useful to restart the iPython kernel. For Jupyter users, the “restart kernel” option in Jupyter corresponds to detaching and re-attaching a notebook in Databricks. To restart the kernel in a Python notebook, click on the cluster dropdown in the upper-left and click Detach & Re-attach. This detaches the notebook from your cluster and reattaches it, which restarts the Python process.

Databricks Repos allows users to synchronize notebooks and other files with Git repositories. Databricks Repos helps with code versioning and collaboration, and it can simplify importing a full repository of code into Databricks, viewing past notebook versions, and integrating with IDE development. Get started by cloning a remote Git repository. You can then open or create notebooks with the repository clone, attach the notebook to a cluster, and run the notebook.

Clusters and libraries

Databricks Clusters provide compute management for both single nodes and large clusters. You can customize cluster hardware and libraries according to your needs. Data scientists will generally begin work either by creating a cluster or using an existing shared cluster. Once you have access to a cluster, you can attach a notebook to the cluster or run a job on the cluster.

Databricks clusters use a Databricks Runtime, which provides many popular libraries out-of-the-box, including Apache Spark, Delta Lake, pandas, and more. You can also install additional third-party or custom Python libraries to use with notebooks and jobs.

Visualizations

Databricks Python notebooks have built-in support for many types of visualizations. You can also use legacy visualizations.

You can also visualize data using third-party libraries; some are pre-installed in the Databricks Runtime, but you can install custom libraries as well. Popular options include:

Jobs

You can automate Python workloads as scheduled or triggered Create, run, and manage Databricks Jobs in Databricks. Jobs can run notebooks, Python scripts, and Python wheels.

  • For details on creating a job via the UI, see Create a job.

  • The Jobs API 2.1 allows you to create, edit, and delete jobs.

Tip

To schedule a Python script instead of a notebook, use the spark_python_task field under tasks in the body of a create job request.

Machine learning

Databricks supports a wide variety of machine learning (ML) workloads, including traditional ML on tabular data, deep learning for computer vision and natural language processing, recommendation systems, graph analytics, and more. For general information about machine learning on Databricks, see the Databricks Machine Learning guide.

For ML algorithms, you can use pre-installed libraries in the Databricks Runtime for Machine Learning, which includes popular Python tools such as scikit-learn, TensorFlow, Keras, PyTorch, Apache Spark MLlib, and XGBoost. You can also install custom libraries.

For machine learning operations (MLOps), Databricks provides a managed service for the open source library MLFlow. MLflow Tracking lets you record model development and save models in reusable formats; the MLflow Model Registry lets you manage and automate the promotion of models towards production; and Jobs and model serving, with Serverless Real-Time Inference or Classic MLflow Model Serving, allow hosting models as batch and streaming jobs and as REST endpoints. For more information and examples, see the MLflow guide or the MLflow Python API docs.

To get started with common machine learning workloads, see the following pages:

IDEs, developer tools, and APIs

In addition to developing Python code within Databricks notebooks, you can develop externally using integrated development environments (IDEs) such as PyCharm, Jupyter, and Visual Studio Code. To synchronize work between external development environments and Databricks, there are several options:

  • Code: You can synchronize code using Git. See Git integration with Databricks Repos.

  • Libraries and Jobs: You can create libraries (such as wheels) externally and upload them to Databricks. Those libraries may be imported within Databricks notebooks, or they can be used to create jobs. See Libraries and Create, run, and manage Databricks Jobs.

  • Remote machine execution: You can run code from your local IDE for interactive development and testing. The IDE can communicate with Databricks to execute large computations on Databricks clusters. To learn to use Databricks Connect to create this connection, see Use IDEs with Databricks.

Databricks provides a full set of REST APIs which support automation and integration with external tooling. You can use APIs to manage resources like clusters and libraries, code and other workspace objects, workloads and jobs, and more. See REST API (latest).

For more information on IDEs, developer tools, and APIs, see Developer tools and guidance.

Additional resources