Load data using a Unity Catalog external location

Preview

This feature is in Public Preview.

This article describes how to use the add data UI to create a managed table from data in Google Cloud Storage using a Unity Catalog external location. An external location is an object that combines a cloud storage path with a storage credential that authorizes access to the cloud storage path.

Before you begin

Before you begin, you must have the following:

File types

The following file types are supported:

  • CSV

  • TSV

  • JSON

  • XML

  • AVRO

  • Parquet

Step 1: Confirm access to the external location

To confirm access to the external location, do the following:

  1. In the sidebar of your Databricks workspace, click Catalog.

  2. In Catalog Explorer, click External Data > External Locations.

Step 2: Create the managed table

To create the managed table, do the following:

  1. In the sidebar of your workspace, click + New > Add data.

  2. In the add data UI, click Google Cloud Storage.

  3. Select an external location from the drop-down list.

  4. Select the folders and the files that you want to load into Databricks, and then click Preview table.

  5. Select a catalog and a schema from the drop-down lists.

  6. (Optional) Edit the table name.

  7. (Optional) To set advanced format options by file type, click Advanced attributes, turn off Automatically detect file type, and then select a file type.

    For a list of format options, see the following section.

  8. (Optional) To edit the column name, click the input box at the top of the column.

    Column names don’t support commas, backslashes, or unicode characters (such as emojis).

  9. (Optional) To edit column types, click the icon with the type.

  10. Click Create table.

File type format options

The following format options are available, depending on the file type:

Format option

Description

Supported file types

Column delimiter

The separator character between columns. Only a single character is allowed, and backslash is not supported.

The default is a comma.

CSV

Escape character

The escape character to use when parsing the data.

The default is a quotation mark.

CSV

First row contains the header

This option specifies whether the file contains a header.

Enabled by default.

CSV

Automatically detect file type

Automatically detect file type. Default is true.

XML

Automatically detect column types

Automatically detect column types from file content. You can edit types in the preview table. If this is set to false, all column types are inferred as STRING.

Enabled by default.

  • CSV

  • JSON

  • XML

Rows span multiple lines

Whether a column’s value can span multiple lines in the file.

Disabled by default.

  • CSV

  • JSON

Merge the schema across multiple files

Whether to infer the schema across multiple files and to merge the schema of each file.

Enabled by default.

CSV

Allow comments

Whether comments are allowed in the file.

Enabled by default.

JSON

Allow single quotes

Whether single quotes are allowed in the file.

Enabled by default.

JSON

Infer timestamp

Whether to try to infer timestamp strings as TimestampType.

Enabled by default.

JSON

Rescued data column

Whether to save columns that don’t match the schema. For more information, see What is the rescued data column?.

Enabled by default.

  • CSV

  • JSON

  • Avro

  • Parquet

Exclude attribute

Whether to exclude attributes in elements. Default is false.

XML

Attribute prefix

The prefix for attributes to differentiate attributes and elements. Default is _.

XML

Column data types

The following column data types are supported. For more information about individual data types see SQL data types.

Data Type

Description

BIGINT

8-byte signed integer numbers.

BOOLEAN

Boolean (true, false) values.

DATE

and day, without a time-zone.

DECIMAL (P,S)

Numbers with maximum precision P and fixed scale S.

DOUBLE

8-byte double-precision floating point numbers.

STRING

Character string values.

TIMESTAMP

Values comprising values of fields year, month, day, hour, minute, and second, with the session local timezone.

Known issues

  • You might experience issues with special characters in complex data types, such as a JSON object with a key containing a backtick or a colon.

  • Some JSON files might require that you manually select JSON for the file type. To manually select a file type after you select files, click Advanced attributes, turn off Automatically detect file type, and then select JSON.

  • Nested timestamps and decimals inside complex types might encounter issues.