Databricks datasets

Databricks includes a variety of datasets mounted to Databricks File System (DBFS). These datasets are used in examples throughout the documentation.

Browse Databricks datasets

To browse these files in Data Science & Engineering or Databricks Machine Learning from a notebook using Python, Scala, or R you can use Databricks Utilities. The code in this example lists all of the available Databricks datasets.

display(dbutils.fs.ls('/databricks-datasets'))
display(dbutils.fs.ls("/databricks-datasets"))
%fs ls "/databricks-datasets"

Get information about Databricks datasets

To get more information about a dataset, you can use a local file API to print out the dataset README (if one is available) by using Python, R, or Scala in a notebook in Data Science & Engineering or Databricks Machine Learning, as shown in this code example.

f = open('/dbfs/databricks-datasets/README.md', 'r')
print(f.read())
scala.io.Source.fromFile("/dbfs/databricks-datasets/README.md").foreach {
  print
}
library(readr)

f = read_lines("/dbfs/databricks-datasets/README.md", skip = 0, n_max = -1L)
print(f)

Create a table based on a Databricks dataset

This code example demonstrates how to use Python, Scala, or R in a notebook to create a table based on a Databricks dataset:

spark.sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta')")
spark.sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta')")
library(SparkR)
sparkR.session()

sql("CREATE TABLE default.people10m OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta')")