Spark Structured Streaming treats streams as unbounded DataFrames, enabling the full Spark SQL API for real-time data processing. Understanding watermarks, output modes, triggers, and checkpointing is critical for production streaming pipelines.
0 / 5 completed
1 / 5
What is the key difference between Spark's Structured Streaming and the older DStream API?
Structured Streaming models a stream as an unbounded table that continuously grows. Engineers write the same DataFrame/SQL transformations used for batch processing, and Spark handles incremental execution. DStreams used RDD-based microbatches with a separate API, requiring different code for batch vs streaming and lacking SQL optimization.
2 / 5
A Spark Structured Streaming job processes late-arriving events. The developer sets .withWatermark('event_time', '10 minutes'). What does this configuration do?
Watermarking tells Spark the maximum expected lateness of events. With withWatermark('event_time', '10 minutes'), Spark maintains window state until the watermark (max event_time seen - 10 min) passes the window's end time. Events arriving within the 10-minute late threshold are included; later events are dropped. This enables bounded state for streaming aggregations.
3 / 5
What does the Complete output mode mean in Spark Structured Streaming?
In Complete mode, Spark outputs the entire aggregation result table after each trigger — every row, updated. This is only valid with aggregation queries where the full result must be materialized. Append mode outputs only new rows; Update mode outputs only changed rows. Complete mode requires storing the full state in memory.
4 / 5
A streaming job uses .trigger(availableNow=True). How does this differ from .trigger(once=True)?
trigger(once=True) processes all available data in one micro-batch then terminates — problematic for large data volumes. trigger(availableNow=True) (Spark 3.3+) processes all available data in multiple optimally-sized micro-batches then terminates. This is preferred for batch-style incremental processing as it better utilizes cluster resources.
5 / 5
What is the purpose of the checkpoint location in a Spark Structured Streaming query?
The checkpoint location stores two things: source offsets (tracking which data has been processed, enabling resume without reprocessing) and streaming state (e.g., running aggregation values for stateful operations). Without checkpointing, a restart reprocesses from the beginning. With it, Spark achieves exactly-once semantics when combined with idempotent sinks.