Vocabulary for Stream Processing: Windowing, Watermarks, and Backpressure
Master the English of stream processing: windowing, watermarks, backpressure, exactly-once, late data, and stateful operators. Precise terms for data and platform engineers.
Stream processing — Flink, Kafka Streams, Spark Structured Streaming — has a dense, specialised vocabulary. Words like windowing, watermark, and backpressure don’t appear in everyday English, so there’s no intuition to fall back on. This guide defines the core terms, shows the verb collocations real engineers use, and flags the pronunciation traps. Master these and you’ll discuss streaming pipelines with precision.
Streams vs. batch: the core distinction
A batch job processes a fixed, bounded dataset and finishes. A stream processes an unbounded, continuous flow of events that never ends.
- Bounded vs. unbounded data
- At rest vs. in motion — data sitting in storage vs. flowing through a pipeline
- Real-time / near-real-time processing
- Throughput (events per second) and latency (time per event)
“We moved the aggregation from a nightly batch job to a streaming pipeline, so dashboards update in near-real-time instead of once a day.”
Event time vs. processing time
This distinction is the heart of stream processing, and the hardest for newcomers:
- Event time — when the event actually happened (the timestamp in the data).
- Processing time — when your system received it.
- Ingestion time — when it entered the pipeline.
These differ because of network delays, retries, and offline devices. The phrase you’ll hear constantly:
“We aggregate by event time, not processing time, because mobile clients buffer events offline and send them late.”
Windowing
Since a stream is infinite, you can’t “wait for all the data.” Instead you group events into windows — finite chunks to compute over.
- Tumbling window — fixed, non-overlapping (e.g. every 5 minutes).
- Sliding window — overlapping (e.g. last 5 minutes, updated every minute).
- Session window — grouped by activity, closed after a gap of inactivity.
- Window size and slide interval
Collocations:
- events fall into a window
- a window fires / closes / emits a result
- we window the stream by 5 minutes
- a window triggers computation
“We use a tumbling window of one minute. When the window fires, it emits the count and clears its state.”
Pronunciation: tumbling (TUM-bling), sliding (SLY-ding). Straightforward, but say them confidently.
Watermarks and late data
A watermark is the system’s estimate of “we’ve probably seen all events up to time T.” It lets the engine decide when a window is complete enough to emit.
- Watermark — a moving threshold of event-time completeness.
- Late data / late-arriving events — events that show up after their window’s watermark passed.
- Allowed lateness — a grace period for late events.
- Dropped / discarded events — late events past the grace period.
“The watermark advances as event time progresses. Events arriving after the watermark are late; within the allowed lateness we still count them, beyond it we drop them.”
This is a classic interview question for streaming roles — be able to explain watermarks in two sentences.
Backpressure
Backpressure is what happens when a downstream operator can’t keep up with an upstream one — the system signals “slow down” so it doesn’t run out of memory.
- a slow consumer exerts backpressure / pushes back
- the pipeline applies backpressure to throttle the source
- without it, the system falls behind and lag grows
- we buffer events, then start throttling the producer
“The sink couldn’t keep up, so it exerted backpressure all the way back to the Kafka source, which throttled ingestion. Lag stayed bounded — exactly what we want.”
Pronunciation: one word, backpressure (BACK-presh-er). Don’t say “back pressure” as two stressed words in a tech context.
State and stateful processing
Stream operators often need state — remembering things between events.
- Stateless — each event handled independently (e.g. a filter).
- Stateful — depends on past events (e.g. a running count).
- Keyed state — state partitioned by a key.
- State backend — where state lives (memory, RocksDB).
- Checkpoint — a periodic snapshot of state for recovery.
- Savepoint — a manually triggered snapshot for upgrades.
“The aggregation is stateful: it keeps a running total in keyed state, backed by RocksDB, and checkpoints every 30 seconds so it can recover after a crash.”
Delivery semantics
Same three guarantees as messaging, but critical here:
- At-most-once — may lose events.
- At-least-once — may duplicate.
- Exactly-once — each event affects state once (the gold standard, hard to achieve).
“We rely on checkpoints and idempotent sinks to achieve exactly-once semantics end to end.”
Related: idempotent sinks, transactional writes, replay from the source on recovery.
When things go wrong
- Lag — how far behind real time the pipeline is. “Consumer lag is climbing.”
- Skew — uneven data distribution overloading one partition (data skew, hot key).
- Stragglers — slow tasks holding up the rest.
- Spill to disk — state too big for memory.
“A hot key caused data skew — one operator did 80% of the work while others idled. We re-keyed to rebalance.”
Before / after: sounding fluent
Before: “Sometimes events come late and the count is wrong, and when the next step is slow everything fills up the memory.”
After: “Late-arriving events past the watermark were dropped, skewing the count. And when the sink slowed, the lack of backpressure caused state to spill and memory to blow up.”
Quick glossary
| Term | One-line meaning |
|---|---|
| Unbounded | Infinite, never-ending stream |
| Event time | When the event actually happened |
| Window | A finite chunk of the stream to compute over |
| Watermark | Estimate of event-time completeness |
| Late data | Events arriving after the watermark |
| Backpressure | Downstream signalling “slow down” |
| Checkpoint | Snapshot of state for recovery |
| Exactly-once | Each event affects state once |
| Skew / hot key | Uneven load on one partition |
Key takeaways
- A stream is unbounded — you compute over windows, not the whole dataset.
- Distinguish event time from processing time; aggregate by event time for correctness.
- Watermarks decide when a window is complete; events past them are late and may be dropped.
- Backpressure keeps a fast producer from overwhelming a slow consumer — learn the collocation “exert backpressure.”
- Stateful pipelines rely on checkpoints for recovery and idempotent sinks for exactly-once delivery.