Download data from the internet

You can use Databricks notebooks to download data from public URLs to volume storage attached to the driver of your cluster. If you are accessing data from cloud object storage, accessing data directly with Apache Spark provides better results. See Connect to data sources.

Databricks clusters provide general compute, allowing you to run arbitrary code in addition to Apache Spark commands. Because arbitrary commands execute against the root directory for the cluster rather than the DBFS root, you must move downloaded data to a new location before reading it with Apache Spark.

Note

Some workspace configurations might prevent access to the public internet. Consult your workspace administrator if you need expanded network access.

Download a file with Bash, Python, or Scala

Databricks does not provide any native tools for downloading data from the internet, but you can use open source tools in supported languages. The following examples use packages for Bash, Python, and Scala to download the same file.

%sh curl https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv --output /tmp/curl-subway.csv
import urllib
urllib.request.urlretrieve("https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv", "/tmp/python-subway.csv")
import java.net.URL
import java.io.File
import org.apache.commons.io.FileUtils

FileUtils.copyURLToFile(new URL("https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv"), new File("/tmp/scala-subway.csv"))

Because these files downloaded to the volume storage attached to the driver, use %sh to see these files, as in the following example:

%sh ls /tmp/

You can use Bash commands to preview the contents of files download this way, as in the following example:

%sh head /tmp/curl-subway.csv

Moving data with dbutils

To access data with Apache Spark, move it from its current location. The current location for this data is in ephemeral volume storage that is only visible to the driver. Databricks loads data from file sources in parallel, and so files must be visible to all nodes in the compute environment. While Databricks supports a wide range of external data sources, file-based data access generally assumes access to cloud object storage. See Connect to data sources.

The Databricks Utilities (dbutils) allow you to move files from volume storage attached to the driver to other locations accessible with the DBFS, including external object storage locations you’ve configured access to. The following example moves data to a directory in the DBFS root, a cloud object storage location configured during initial workspace deployment.

dbutils.fs.mv("file:/tmp/curl-subway.csv", "dbfs:/tmp/subway.csv")

Reading downloaded data

After you move the data to cloud object storage, you can read the data as normal. The following code reads in the CSV data moved to the DBFS root.

df = spark.read.format("csv").option("header", True).load("/tmp/subway.csv")
display(df)