Can you use pandas on Databricks?

Databricks Runtime includes pandas as one of the standard Python packages, allowing you to create and leverage pandas DataFrames in Databricks notebooks and jobs.

In Databricks Runtime 10.4 LTS and above, Pandas API on Spark provides familiar pandas commands on top of PySpark DataFrames. You can also convert DataFrames between pandas and PySpark.

Apache Spark includes Arrow-optimized execution of Python logic in the form of pandas function APIs, which allow users to apply pandas transformations directly to PySpark DataFrames. Apache Spark also supports pandas UDFs, which use similar Arrow-optimizations for arbitrary user functions defined in Python.

Where does pandas store data on Databricks?

You can use pandas to store data in many different locations on Databricks. Your ability to store and load data from some locations depends on configurations set by workspace administrators.

Note

Databricks recommends storing production data on cloud object storage. See Connect to Google Cloud Storage.

For quick exploration and data without sensitive information, you can safely save data using either relative paths or the DBFS, as in the following examples:

import pandas as pd

df = pd.DataFrame([["a", 1], ["b", 2], ["c", 3]])

df.to_csv("./relative_path_test.csv")
df.to_csv("/dbfs/dbfs_test.csv")

You can explore files written to the DBFS with the %fs magic command, as in the following example. Note that the /dbfs directory is the root path for these commands.

%fs ls

When you save to a relative path, the location of your file depends on where you execute your code. If you’re using a Databricks notebook, your data file saves to the volume storage attached to the driver of your cluster. Data stored in this location is permanently deleted when the cluster terminates. If you’re using Databricks Git folders with arbitrary file support enabled, your data saves to the root of your current project. In either case, you can explore the files written using the %sh magic command, which allows simple bash operations relative to your current root directory, as in the following example:

%sh ls

For more information on how Databricks stores various files, see Work with files on Databricks.

How do you load data with pandas on Databricks?

Databricks provides a number of options to facilitate uploading data to the workspace for exploration. The preferred method to load data with pandas varies depending on how you load your data to the workspace.

If you have small data files stored alongside notebooks on your local machine, you can upload your data and code together with Git folders. You can then use relative paths to load data files.

Databricks provides extensive UI-based options for data loading. Most of these options store your data as Delta tables. You can read a Delta table to a Spark DataFrame, and then convert that to a pandas DataFrame.

If you have saved data files using DBFS or relative paths, you can use DBFS or relative paths to reload those data files. The following code provides an example:

import pandas as pd

df = pd.read_csv("./relative_path_test.csv")
df = pd.read_csv("/dbfs/dbfs_test.csv")

You can load data directly from Google Cloud Storage using pandas and a fully qualified URL. You need to provide cloud credentials to access cloud data.

df = pd.read_csv(
  f"gs://{bucket_name}/{file_path}",
  storage_options={
    "token": credentials
  }
)