Expand and read Zip compressed files
You can use the unzip
Bash command to expand files or directories of files that have been Zip compressed. If you download or encounter a file or directory ending with .zip
, expand the data before trying to continue.
Note
Apache Spark provides native codecs for interacting with compressed Parquet files. Most Parquet files written by Databricks end with .snappy.parquet
, indicating they use snappy compression.
How to unzip data
The Databricks %sh
magic command enables execution of arbitrary Bash code, including the unzip
command.
The following example uses a zipped CSV file downloaded from the internet. See Download data from the internet.
Note
You can use the Databricks Utilities to move files to the ephemeral storage attached to the driver before expanding them. You cannot expand zip files while they reside in Unity Catalog volumes. See Databricks Utilities (dbutils) reference.
The following code uses curl
to download and then unzip
to expand the data:
%sh curl https://resources.lendingclub.com/LoanStats3a.csv.zip --output /tmp/LoanStats3a.csv.zip
unzip /tmp/LoanStats3a.csv.zip
Use dbutils to move the expanded file to a Unity Catalog volume, as follows:
dbutils.fs.mv("file:/LoanStats3a.csv", "/Volumes/my_catalog/my_schema/my_volume/LoanStats3a.csv")
In this example, the downloaded data has a comment in the first row and a header in the second. Now that the data has been expanded and moved, use standard options for reading CSV files, as in the following example:
df = spark.read.format("csv").option("skipRows", 1).option("header", True).load("/Volumes/my_catalog/my_schema/my_volume/LoanStats3a.csv")
display(df)