Hadoop does not have support for zip files as a compression codec. While a text file in GZip, BZip2, and other supported compression formats can be configured to be automatically decompressed in Apache Spark as long as it has the right file extension, you must perform additional steps to read zip files.
The following notebooks show how to read zip files. After you download a zip file to a temp directory, you can invoke the Databricks
%sh zip magic command to unzip the file. For the sample file used in the notebooks, the
tail step removes a comment line from the unzipped file.
When you use
%sh to operate on files, the results are stored in the directory
/databricks/driver. Before you load the file using the Spark API, you move the file to DBFS using Databricks Utilities.