CSV file

This article provides examples for reading and writing to CSV files with Databricks using Python, Scala, R, and SQL.

Note

You can use SQL to read CSV data directly or by using a temporary view. Databricks recommends using a temporary view. Reading the CSV file directly has the following drawbacks:

See Examples.

Options

You can configure several options for CSV file data sources. See the following Apache Spark reference articles for supported read and write options.

Rescued data column

Note

This feature is supported in Databricks Runtime 8.3 (Unsupported) and above.

The rescued data column ensures that you never lose or miss out on data during ETL. The rescued data column contains any data that wasn’t parsed, either because it was missing from the given schema, or because there was a type mismatch, or because the casing of the column in the record or file didn’t match with that in the schema. The rescued data column is returned as a JSON blob containing the columns that were rescued, and the source file path of the record (the source file path is available in Databricks Runtime 8.3 and above). To remove the source file path from the rescued data column, you can set the SQL configuration spark.conf.set("spark.databricks.sql.rescuedDataColumn.filePath.enabled", "false"). You can enable the rescued data column by setting the option rescuedDataColumn to a column name, such as _rescued_data with spark.read.option("rescuedDataColumn", "_rescued_data").format("csv").load(<path>).

The CSV parser supports three modes when parsing records: PERMISSIVE, DROPMALFORMED, and FAILFAST. When used together with rescuedDataColumn, data type mismatches do not cause records to be dropped in DROPMALFORMED mode or throw an error in FAILFAST mode. Only corrupt records—that is, incomplete or malformed CSV—are dropped or throw errors. If you use the option badRecordsPath when parsing CSV, data type mismatches are not considered as bad records when using the rescuedDataColumn. Only incomplete and malformed CSV records are stored in badRecordsPath.

Examples

These examples use the diamonds dataset. Specify the path to the dataset as well as any options that you would like.

Read file in any language

This notebook shows how to read a file, display sample data, and print the data schema using Scala, R, Python, and SQL.

Read CSV files notebook

Open notebook in new tab

Specify schema

When the schema of the CSV file is known, you can specify the desired schema to the CSV reader with the schema option.

Read CSV files with schema notebook

Open notebook in new tab

Verify correctness of the data

When reading CSV files with a specified schema, it is possible that the data in the files does not match the schema. For example, a field containing name of the city will not parse as an integer. The consequences depend on the mode that the parser runs in:

  • PERMISSIVE (default): nulls are inserted for fields that could not be parsed correctly

  • DROPMALFORMED: drops lines that contain fields that could not be parsed

  • FAILFAST: aborts the reading if any malformed data is found

To set the mode, use the mode option.

val diamonds_with_wrong_schema_drop_malformed = spark.read.format("csv").option("mode", "PERMISSIVE")

In the PERMISSIVE mode it is possible to inspect the rows that could not be parsed correctly. To do that, you can add _corrupt_record column to the schema.

Find malformed rows notebook

Open notebook in new tab

Pitfalls of reading a subset of columns

The behavior of the CSV parser depends on the set of columns that are read. If the specified schema is incorrect, the results might differ considerably depending on the subset of columns that is accessed. The following notebook presents the most common pitfalls.

Caveats of reading a subset of columns of a CSV file notebook

Open notebook in new tab