Streaming on Databricks

You can use Databricks for near real-time data ingestion, processing, machine learning, and AI for streaming data.

Databricks offers numerous optimizations for streaming and incremental processing. For most streaming or incremental data processing or ETL tasks, Databricks recommends Delta Live Tables. See What is Delta Live Tables?.

Most incremental and streaming workloads on Databricks are powered by Structured Streaming, including Delta Live Tables and Auto Loader. See What is Auto Loader?.

Delta Lake and Structured Streaming have tight integration to power incremental processing in the Databricks lakehouse. See Delta table streaming reads and writes.

To learn more about building streaming solutions on the Databricks platform, see the data streaming product page.

Databricks has specific features for working with semi-structured data fields contained in Avro, protocol buffers, and JSON data payloads. To learn more, see:

What is Structured Streaming?

Apache Spark Structured Streaming is a near real-time processing engine that offers end-to-end fault tolerance with exactly-once processing guarantees using familiar Spark APIs. Structured Streaming lets you express computation on streaming data in the same way you express a batch computation on static data. The Structured Streaming engine performs the computation incrementally and continuously updates the result as streaming data arrives.

If you’re new to Structured Streaming, see Run your first Structured Streaming workload.

For information about using Structured Streaming with Unity Catalog, see Using Unity Catalog with Structured Streaming.

What streaming sources and sinks does Databricks support?

Databricks recommends using Auto Loader to ingest supported file types from cloud object storage into Delta Lake. For ETL pipelines, Databricks recommends using Delta Live Tables (which uses Delta tables and Structured Streaming). You can also configure incremental ETL workloads by streaming to and from Delta Lake tables.

In addition to Delta Lake and Auto Loader, Structured Streaming can connect to messaging services such as Apache Kafka.

You can also Use foreachBatch to write to arbitrary data sinks.

Additional resources

Apache Spark provides a Structured Streaming Programming Guide that has more information about Structured Streaming.

For reference information about Structured Streaming, Databricks recommends the following Apache Spark API references: