Sample datasets

There are a variety of datasets provided by third parties that you can upload to your Databricks workspace and use. Databricks also provides a variety of datasets that are already mounted to DBFS in your Databricks workspace.

Third-party sample datasets

Databricks has built-in tools to quickly upload third-party sample datasets as comma-separated values (CSV) files into Databricks workspaces. Some popular third-party sample datasets available in CSV format:

Sample dataset

To download the sample dataset as a CSV file…

The Squirrel Census

On the Data webpage, click Park Data, Squirrel Data, or Stories.

OWID Dataset Collection

In the GitHub repository, click the datasets folder. Click the subfolder that contains the target dataset, and then click the dataset’s CSV file. CSV datasets

On the search results webpage, click the target search result, and next to the CSV icon, click Download.

Diamonds (Requires a Kaggle account)

On the dataset’s webpage, on the Data tab, on the Data tab, next to diamonds.csv, click the Download icon.

NYC Taxi Trip Duration (Requires a Kaggle account)

On the dataset’s webpage, on the Data tab, next to, click the Download icon. To find the dataset’s CSV files, extracts the contents of the downloaded ZIP file.

UFO Sightings (Requires a account)

On the dataset’s webpage, next to nuforc_reports.csv, click the Download icon.

To use third-party sample datasets in your Databricks workspace, do the following:

  1. Follow the third-party’s instructions to download the dataset as a CSV file to your local machine.

  2. Upload the CSV file from your local machine into your Databricks workspace.

  3. To work with the imported data, use Databricks SQL to query the data. Or you can use a notebook to load the data as a DataFrame.

Databricks datasets (databricks-datasets)

Databricks includes a variety of datasets mounted to DBFS.


The availability and location of Databricks datasets are subject to change without notice.

Browse Databricks datasets

To browse these files in Data Science & Engineering or Databricks Machine Learning from a notebook using Python, Scala, or R you can use Databricks Utilities. The code in this example lists all of the available Databricks datasets.

%fs ls "/databricks-datasets"

Get information about Databricks datasets

To get more information about a dataset, you can use a local file API to print out the dataset README (if one is available) by using Python, R, or Scala in a notebook in Data Science & Engineering or Databricks Machine Learning, as shown in this code example.

f = open('/dbfs/databricks-datasets/', 'r')
print("/dbfs/databricks-datasets/").foreach {

f = read_lines("/dbfs/databricks-datasets/", skip = 0, n_max = -1L)

Create a table based on a Databricks dataset

This code example demonstrates how to use Python, Scala, or R in a notebook to create a table based on a Databricks dataset:

spark.sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/')")
spark.sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/')")

sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/')")