Structured Streaming concepts
October 02, 2024
This article provides an introduction to Structured Streaming on Databricks.
What is Structured Streaming?
Apache Spark Structured Streaming is a near real-time processing engine that offers end-to-end fault tolerance with exactly-once processing guarantees using familiar Spark APIs. Structured Streaming lets you express computation on streaming data in the same way you express a batch computation on static data. The Structured Streaming engine performs the computation incrementally and continuously updates the result as streaming data arrives.
Read from a data stream
You can use Structured Streaming to incrementally ingest data from supported data sources. Common data sources include the following:
Data files in cloud object storage. See What is Auto Loader?.
Message buses and queues. See Configure streaming data sources.
Delta Lake. See Delta table streaming reads and writes.
Each data source provides a number of options to specify how to load batches of data. During reader configuration, you might need to configure options to do the following:
Specify the data source or format (for example, file type, delimiters, and schema).
Configure access to source systems (for example, port settings and credentials).
Specify where to start in a stream (for example, Kafka offsets or reading all existing files).
Control how much data is processed in each batch (for example, max offsets, files, or bytes per batch). See Configure Structured Streaming batch size on Databricks.
Write to a data sink
A data sink is the target of a streaming write operation. Common sinks used in Databricks streaming workloads include the following:
Delta Lake
Message buses and queues
Key-value databases
As with data sources, most data sinks provide a number of options to control how data is written to the target system. During writer configuration, you specify the following options:
Output mode (append by default). See Select an output mode for Structured Streaming.
A checkpoint location (required for each writer). See Structured Streaming checkpoints.
Trigger intervals. See Configure Structured Streaming trigger intervals.
Options that specify the data sink or format (for example, file type, delimiters, and schema).
Options that configure access to target systems (for example, port settings and credentials).