AdvancedVocabulary#apache-spark#streaming#data-engineering#real-time#big-data

Apache Spark Structured Streaming: Vocabulary

Spark Structured Streaming treats streams as unbounded DataFrames, enabling the full Spark SQL API for real-time data processing. Understanding watermarks, output modes, triggers, and checkpointing is critical for production streaming pipelines.

0 / 5 completed

1 / 5

What is the key difference between Spark's Structured Streaming and the older DStream API?

2 / 5

A Spark Structured Streaming job processes late-arriving events. The developer sets .withWatermark('event_time', '10 minutes'). What does this configuration do?

3 / 5

What does the Complete output mode mean in Spark Structured Streaming?

4 / 5

A streaming job uses .trigger(availableNow=True). How does this differ from .trigger(once=True)?

5 / 5

What is the purpose of the checkpoint location in a Spark Structured Streaming query?