Interact with external data on Databricks

Databricks Runtime provides bindings to popular data sources and formats to make importing and exporting data from the lakehouse simple. This article provides information to help you identify formats and integrations that have built-in support. You can also discover ways to extend Databricks to interact with even more systems. Most data on Databricks live in cloud object storage. See Where’s my data?.

Databricks provides a number of optimizations for data loading and ingestion.

Databricks also supports query federation. See Run queries using Lakehouse Federation.

If you have not read or written data with Databricks before, consider reviewing the DataFrames tutorial for Python or Scala. Even for users familiar with Apache Spark, this tutorial might address new challenges associated with accessing data in the cloud.

Partner Connect provides optimized, easy-to-configure integrations to many enterprise solutions. See What is Databricks Partner Connect?.

What data formats can you use in Databricks?

Databricks has built-in keyword bindings for all the data formats natively supported by Apache Spark. Databricks uses Delta Lake as the default protocol for reading and writing data and tables, whereas Apache Spark uses Parquet.

The following data formats all have built-in keyword configurations in Apache Spark DataFrames and SQL:

Databricks also provides a custom keyword for loading MLflow experiments.

Work with streaming data sources on Databricks

Databricks can integrate with stream messaging services for near-real time data ingestion into the Databricks lakehouse. Databricks can also sync enriched and transformed data in the lakehouse with other streaming systems.

Structured Streaming provides native streaming access to file formats supported by Apache Spark, but Databricks recommends Auto Loader for most Structured Streaming operations that read data from cloud object storage. See What is Auto Loader?.

Ingesting streaming messages to Delta Lake allows you to retain messages indefinitely, allowing you to replay data streams without fear of losing data due to retention thresholds.

Databricks has specific features for working with semi-structured data fields contained in Avro, protocol buffers, and JSON data payloads. To learn more, see:

To learn more about specific configurations for streaming from or to message queues, see:

What data sources connect to Databricks with JDBC?

You can use JDBC to connect with many data sources. Databricks Runtime includes drivers for a number of JDBC databases, but you might need to install a driver or different driver version to connect to your preferred database. Supported databases include the following:

You may prefer Lakehouse Federation for managing queries to external database systems. See Run queries using Lakehouse Federation.

What data services does Databricks integrate with?

The following data services require you to configure connection settings, security credentials, and networking settings. You might need administrator or power user privileges in your Google Cloud account or Databricks workspace. Some also require that you create a Databricks library and install it in a cluster:

You may prefer Lakehouse Federation for managing queries to external database systems. See Run queries using Lakehouse Federation.

Data formats with special considerations

The following data formats may require additional configuration or special considerations for use:

  • Databricks recommends loading images as binary data.

  • XML is not natively supported, but can be used after installing a library.

  • Hive tables are also natively supported by Apache Spark, but require configuration on Databricks.

  • Databricks can directly read many file formats while still compressed. You can also unzip compressed files on Databricks if necessary.

  • LZO requires a codec installation.

For more information about Apache Spark data sources, see Generic Load/Save Functions and Generic File Source Options.