What is Apache Spark Structured Streaming?

Apache Spark Structured Streaming is a near-real time processing engine that offers end-to-end fault tolerance with exactly-once processing guarantees using familiar Spark APIs. Structured Streaming lets you express computation on streaming data in the same way you express a batch computation on static data. The Structured Streaming engine performs the computation incrementally and continuously updates the result as streaming data arrives. For an overview of Structured Streaming, see the Apache Spark Structured Streaming Programming Guide.

For information about using Structured Streaming with Unity Catalog, see Using Unity Catalog with Structured Streaming.

How is Structured Streaming used on Databricks?

Structured Streaming pairs tightly with Delta Lake to offer enhanced functionality for incremental data processing at scale in the Databricks Lakehouse. Structured Streaming is the core technology at the heart of Databricks Auto Loader, as well as Delta Live Tables.

What streaming sources and sinks does Databricks support?

Databricks recommends using Auto Loader to ingest supported file types from cloud object storage into Delta Lake. For ETL pipelines, Databricks recommends using Delta Live Tables (which uses Delta tables and Structured Streaming). You can also configure incremental ETL workloads by streaming to and from Delta Lake tables.

In addition to Delta Lake and Auto Loader, Structured Streaming can connect to messaging services such as Apache Kafka.

You can also Use foreachBatch to write to arbitrary data sinks.

What are best practices for Structured Streaming in production?

Databricks supports a number of edge features not found in Apache Spark to help customers get the best performance out of Structured Streaming. Learn more about these features and other recommendations for Production considerations for Structured Streaming.

Examples

For introductory notebooks and notebooks demonstrating example use cases, see Structured Streaming patterns on Databricks.

API reference

For reference information about Structured Streaming, Databricks recommends the following Apache Spark API reference: