Databricks datasets

Databricks includes a variety of datasets mounted to Databricks File System (DBFS). These datasets are used in examples throughout the documentation.

Browse Databricks datasets

To browse these files in Data Science & Engineering or Databricks Machine Learning from a notebook using Python, Scala, or R you can use Databricks Utilities. The code in this example lists all of the available Databricks datasets.

%fs ls "/databricks-datasets"

Get information about Databricks datasets

To get more information about a dataset, you can use a local file API to print out the dataset README (if one is available) by using Python, R, or Scala in a notebook in Data Science & Engineering or Databricks Machine Learning, as shown in this code example.

f = open('/dbfs/databricks-datasets/', 'r')
print("/dbfs/databricks-datasets/").foreach {

f = read_lines("/dbfs/databricks-datasets/", skip = 0, n_max = -1L)

Create a table based on a Databricks dataset

This code example demonstrates how to use Python, Scala, or R in a notebook to create a table based on a Databricks dataset:

spark.sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/')")
spark.sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/')")

sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/')")