Ingest data into a Databricks lakehouse

Databricks offers various methods for ingesting data into a lakehouse backed by Delta Lake. This article lists supported ingestion tools and guidance on which method to use based on criteria like data source and latency.

Ingestion methods

You can ingest data into Databricks using the following methods:

Batch ingestion of a set of data rows for infrequent processing
Streaming ingestion of individual data rows or sets of data rows as they arrive for real-time processing

Ingested data is loaded into Delta tables that can then be used across your downstream data and AI use cases. Because of Databricks’ Lakehouse architecture, you do not need to duplicate your data across use cases, and you can leverage Unity Catalog for centralized access control, auditing, lineage, and data discovery across all of your data.

Batch ingestion

With batch ingestion you load data as sets (or batches) of rows into Databricks often based on a schedule (for example, every day) or triggered manually. This represents the “extract” piece of traditional extract, transform, load (ETL) use cases. You can use batch ingestion to load data from:

Local files like CSVs
Cloud object storage, including Amazon S3, Azure Data Lake Storage, and Google Cloud Storage

Batch ingestion supports a wide range of file source formats, including CSV, TSV, JSON, XML, Avro, ORC, Parquet, and text files.

Databricks supports both traditional batch ingestion and incremental batch ingestion options. While traditional batch ingestion processes all records each time it runs, incremental batch ingestion automatically detects new records in the data source and ignores records that have already been ingested. This means less data needs to be processed and, as result, ingestion jobs run faster and use compute resources more efficiently.

Traditional (one-time) batch ingestion

You can upload local data files or download files from a public URL using the add data UI. See Upload files.

Incremental batch ingestion

This section describes supported incremental batch ingestion tools.

Streaming tables

The CREATE STREAMING TABLE SQL command lets you load data incrementally into streaming tables from cloud object storage. See CREATE STREAMING TABLE.

Example: Incremental batch ingestion using streaming tables

CREATE OR REFRESH STREAMING TABLE customers
AS SELECT * FROM read_files("/databricks-datasets/retail-org/customers/", "csv")

Cloud object storage connector

Auto Loader, the built-in cloud object storage connector, allows you to incrementally and efficiently process new data files as they arrive in Amazon S3 (S3), Azure Data Lake Storage Gen 2 (ALDS2), or Google Cloud Storage (GCS). See Auto Loader.

Example: Incremental batch ingestion using Auto Loader

df = spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "csv")
  .option("rescuedDataColumn", "_rescued_data")
  .schema("/databricks-datasets/retail-org/customers/schema")
  .load("/databricks-datasets/retail-org/customers/")

Streaming ingestion

With streaming ingestion you continuously load data rows or batches of data rows as it is generated so you can query it as it arrives in near real-time. You can use streaming ingestion to load streaming data from sources like Apache Kafka, Amazon Kinesis, Google Pub/Sub, and Apache Pulsar.

Databricks also supports streaming ingestion using built-in connectors. These connectors allow you to incrementally and efficiently process new data as it arrives from streaming sources. See Configure streaming data sources.

Example: Streaming ingestion from Kafka

spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "<server:ip>")
    .option("subscribe", "topic1")
    .option("startingOffsets", "latest")
    .load()

Batch and streaming ingestion with Delta Live Tables

Databricks recommends using Delta Live Tables to build reliable and scalable data processing pipelines. Delta Live Tables supports both batch and streaming ingestion, and you can ingest data from any data source supported by Auto Loader.

Example: Incremental batch ingestion using Delta Live Tables

@dlt.table
def customers():
  return (
    spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .load("/databricks-datasets/retail-org/customers/")
  )

Example: Streaming ingestion from Kafka using Delta Live Tables

@dlt.table
def kafka_raw():
  return (
    spark.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "<server:ip>")
      .option("subscribe", "topic1")
      .option("startingOffsets", "latest")
      .load()
  )

Ingestion schedules

You can ingest data as a one-time operation, on a recurring schedule, or continuously.

For near real-time streaming use cases, use continuous mode.
For batch ingestion use cases, ingest one time or set a recurring schedule.

See Ingestion with Jobs and Triggered vs. continuous pipeline mode.

Ingestion partners

Many third-party tools support batch or streaming ingestion into Databricks. Databricks validates various third-party integrations, although the steps to configure access to source systems and ingest data vary by tool. See Ingestion partners for a list of validated tools. Some technology partners are also featured in Databricks Partner Connect, which provides a UI that simplifies connecting third-party tools to lakehouse data.

DIY ingestion

Databricks provides a general compute platform. As a result, you can create your own ingestion connectors using any programming language supported by Databricks, like Python or Java. You can also import and leverage popular open source connector libraries like data load tool, Airbyte, and Debezium.

Ingestion alternatives

Databricks recommends ingestion for most use cases because it scales to accommodate high data volumes, low-latency querying, and third-party API limits. Ingestion copies data from your source systems to Databricks, which results in duplicate data that might become stale over time. If you don’t want to copy data, you can use the following tools:

Lakehouse Federation allows you to query external data sources without moving your data.
Delta Sharing allows you to securely share data across platforms, clouds, and regions.

However, if you don’t want to copy data, use Lakehouse Federation or Delta Sharing.

When to use Lakehouse Federation

Choose Lakehouse Federation for the following scenarios:

Ad hoc reporting or proof-of-concept work on your ETL pipelines

Migrate data to Delta Lake

To learn how to migrate existing data to Delta Lake, see Migrate data to Delta Lake.

COPY INTO (legacy)

The CREATE STREAMING TABLE SQL command is the recommended alternative to the legacy COPY INTO SQL command for incremental ingestion from cloud object storage. See COPY INTO. For a more scalable and robust file ingestion experience, Databricks recommends that SQL users leverage streaming tables instead of COPY INTO.

Additional resources

What is the difference between streaming and incremental batch ingestion?