Structured Streaming concepts

October 02, 2024

This article provides an introduction to Structured Streaming on Databricks.

What is Structured Streaming?

Apache Spark Structured Streaming is a near real-time processing engine that offers end-to-end fault tolerance with exactly-once processing guarantees using familiar Spark APIs. Structured Streaming lets you express computation on streaming data in the same way you express a batch computation on static data. The Structured Streaming engine performs the computation incrementally and continuously updates the result as streaming data arrives.

Read from a data stream

You can use Structured Streaming to incrementally ingest data from supported data sources. Common data sources include the following:

Each data source provides a number of options to specify how to load batches of data. During reader configuration, you might need to configure options to do the following:

  • Specify the data source or format (for example, file type, delimiters, and schema).

  • Configure access to source systems (for example, port settings and credentials).

  • Specify where to start in a stream (for example, Kafka offsets or reading all existing files).

  • Control how much data is processed in each batch (for example, max offsets, files, or bytes per batch). See Configure Structured Streaming batch size on Databricks.

Write to a data sink

A data sink is the target of a streaming write operation. Common sinks used in Databricks streaming workloads include the following:

  • Delta Lake

  • Message buses and queues

  • Key-value databases

As with data sources, most data sinks provide a number of options to control how data is written to the target system. During writer configuration, you specify the following options: