JSON file

You can read JSON files in single-line or multi-line mode. In single-line mode, a file can be split into many parts and read in parallel. In multi-line mode, a file is loaded as a whole entity and cannot be split.

For further information, see JSON Files.


See the following Apache Spark reference articles for supported read and write options.

Rescued data column


This feature is supported in Databricks Runtime 8.2 (Unsupported) and above.

The rescued data column ensures that you never lose or miss out on data during ETL. The rescued data column contains any data that wasn’t parsed, either because it was missing from the given schema, or because there was a type mismatch, or because the casing of the column in the record or file didn’t match with that in the schema. The rescued data column is returned as a JSON blob containing the columns that were rescued, and the source file path of the record (the source file path is available in Databricks Runtime 8.3 and above). To remove the source file path from the rescued data column, you can set the SQL configuration spark.conf.set("spark.databricks.sql.rescuedDataColumn.filePath.enabled", "false"). You can enable the rescued data column by setting the option rescuedDataColumn to a column name, such as _rescued_data with spark.read.option("rescuedDataColumn", "_rescued_data").format("json").load(<path>).

The JSON parser supports three modes when parsing records: PERMISSIVE, DROPMALFORMED, and FAILFAST. When used together with rescuedDataColumn, data type mismatches do not cause records to be dropped in DROPMALFORMED mode or throw an error in FAILFAST mode. Only corrupt records—that is, incomplete or malformed JSON—are dropped or throw errors. If you use the option badRecordsPath when parsing JSON, data type mismatches are not considered as bad records when using the rescuedDataColumn. Only incomplete and malformed JSON records are stored in badRecordsPath.


Single-line mode

In this example, there is one JSON object per line:

{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}
{"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3", "extra_key": "extra_value3"}}

To read the JSON data, use:

val df = spark.read.format("json").load("example.json")

Spark infers the schema automatically.

 |-- array: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- dict: struct (nullable = true)
 |    |-- extra_key: string (nullable = true)
 |    |-- key: string (nullable = true)
 |-- int: long (nullable = true)
 |-- string: string (nullable = true)

Multi-line mode

This JSON object occupies multiple lines:

  {"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
  {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}},
    "string": "string3",
    "int": 3,
    "array": [
    "dict": {
        "key": "value3",
        "extra_key": "extra_value3"

To read this object, enable multi-line mode:

USING json
OPTIONS (path="/tmp/multi-line.json",multiline=true)
val mdf = spark.read.option("multiline", "true").format("json").load("/tmp/multi-line.json")

Charset auto-detection

By default, the charset of input files is detected automatically. You can specify the charset explicitly using the charset option:

spark.read.option("charset", "UTF-16BE").format("json").load("fileInUTF16.json")

Some supported charsets include: UTF-8, UTF-16BE, UTF-16LE, UTF-16, UTF-32BE, UTF-32LE, UTF-32. For the full list of charsets supported by Oracle Java SE, see Supported Encodings.


The following notebook demonstrates single line and multi-line mode.

Read JSON files notebook

Open notebook in new tab