Watermarks and Windows — Stream Processing Vocabulary
Learn vocabulary for event time vs. processing time, watermarks, tumbling windows, sliding windows, session windows, and late data handling.
0 / 5 completed
1 / 5
What is the difference between 'event time' and 'processing time' in stream processing vocabulary?
Event time (embedded in the record payload — e.g., sensor timestamp): when the event actually occurred. Processing time (wall clock of the processing engine): when the event was processed. They diverge due to network delays, retries, or backfill. Event-time semantics give correct results (e.g., revenue per hour) regardless of when data arrives; processing-time semantics are simpler but produce incorrect results for late or out-of-order data.
2 / 5
What is a 'watermark' in stream processing vocabulary?
Watermarks (Flink, Beam, Dataflow): the engine generates a watermark W(t) meaning 'I believe all events with event_time < t have been received.' When the watermark passes a window's end time, the engine closes the window and emits results. Watermarks are heuristic — they allow some late-arriving data tolerance. A watermark that is too aggressive causes incorrect results; too conservative causes high latency.
3 / 5
What is a 'tumbling window' in stream processing vocabulary?
Tumbling windows: fixed duration, no overlap. Every event belongs to exactly one window. Ideal for aggregations like 'total orders in each 1-hour period.' Simple to reason about and compute. Contrast with sliding windows (overlap) and session windows (dynamic, gap-based). Common in billing, reporting, and rate calculations.
4 / 5
What is a 'session window' in stream processing vocabulary?
Session windows are key for behavioral analytics: group all events from a user/device into a 'session' separated by periods of inactivity. If a user clicks at 2:00, 2:05, 2:08, then nothing until 3:00, and the gap timeout is 15 minutes, you get one session [2:00-2:08]. Sessions have variable duration — unlike tumbling or sliding windows. Useful for clickstream analysis, user journey aggregation.
5 / 5
What is 'late data handling' in stream processing vocabulary?
Late data is inevitable: mobile apps buffer events offline, network delays vary, clock skew exists. Strategies: (1) Allowed lateness — keep the window open extra time after the watermark passes (emit updated results). (2) Side outputs / dead-letter — route late events to a separate stream for reprocessing or alerting. (3) Drop — acceptable when late data is rare and approximate results suffice. Choice depends on correctness requirements.