Interact with external data on Databricks

Databricks Runtime provides bindings to popular data sources and formats to make importing and exporting data from the lakehouse simple. This article provides information to help you identify formats and integrations that have built-in support. You can also discover ways to extend Databricks to interact with even more systems.

Databricks provides a number of optimizations for data loading and ingestion.

Databricks also supports query federation for both SQL and DataFrame users. See What is query federation?.

If you have not read or written data with Databricks before, consider reviewing the DataFrames tutorial for Python or Scala. Even for users familiar with Apache Spark, this tutorial might address new challenges associated with accessing data in the cloud.

Partner Connect provides optimized, easy-to-configure integrations to many enterprise solutions. See What is Databricks Partner Connect?.

What data formats can you use in Databricks?

Databricks has built-in keyword bindings for all the data formats natively supported by Apache Spark. Databricks uses Delta Lake as the default protocol for reading and writing data and tables, whereas Apache Spark uses Parquet.

The following data formats all have built-in keyword configurations in Apache Spark DataFrames and SQL:

Databricks also provides a custom keyword for loading MLflow experiments.

Data formats with special considerations

The following data formats may require additional configuration or special consideration for use:

  • Databricks recommends loading images as binary data.

  • XML is not natively supported, but can be used after installing a library.

  • Hive tables are also natively supported by Apache Spark, but require configuration on Databricks.

  • Databricks can directly read many file formats while still compressed. You can also unzip compressed files on Databricks if necessary.

  • LZO requires a codec installation.

For more information about Apache Spark data sources, see Generic Load/Save Functions and Generic File Source Options.

How do you configure cloud object storage for Databricks?

Databricks uses cloud object storage to store data files and tables. During workspace deployment, Databricks configures a cloud object storage location known as the DBFS root. You can configure connections to other cloud object storage locations in your account.

In almost all cases, the data files you interact with using Apache Spark on Databricks are stored in cloud object storage. See the following articles for guidance on configuring connections:

What data sources connect to Databricks with JDBC?

You can use JDBC to connect with many data sources. Databricks Runtime includes drivers for a number of JDBC databases, but you might need to install a driver or different driver version to connect to your preferred database. Supported databases include the following:

What data services does Databricks integrate with?

The following data services require you to configure connection settings, security credentials, and networking settings. You might need administrator or power user privileges in your Google Cloud account or Databricks workspace. Some also require that you create a Databricks library and install it in a cluster: