Practice stream processing troubleshooting vocabulary: checkpoint failures, watermarks, operator state, parallelism scaling, and stalled streaming jobs.
0 / 5 completed
1 / 5
An engineer says 'the checkpoint is failing — the job will restart'. What is a checkpoint in stream processing?
In frameworks like Apache Flink, checkpoints are periodic snapshots of all operator state. If the job fails, it restarts from the last successful checkpoint rather than from the beginning, ensuring at-least-once or exactly-once processing guarantees. A failing checkpoint means the job cannot recover safely, so it restarts and tries again.
2 / 5
'Late data is being dropped after the watermark.' What is a watermark in stream processing?
A watermark is a mechanism for handling out-of-order data in time-windowed operations. It defines a point in event time up to which the system assumes all data has arrived. Events arriving after the watermark threshold are classified as late and may be dropped or handled separately depending on the configured late data policy.
3 / 5
'The operator state is too large.' What problem does this cause in a streaming job?
Operator state stores intermediate computation results (e.g., aggregations, joins). Very large state causes slow checkpoints (potentially timing out), high memory pressure, and long recovery times after failure. Solutions include reducing state size with TTL (time-to-live), switching to incremental checkpoints, or offloading state to an external store like RocksDB.
4 / 5
'We increased the parallelism to handle the backlog.' What does increasing parallelism do in a streaming job?
Parallelism in frameworks like Flink or Spark Streaming means running multiple task instances in parallel. Higher parallelism means more data is processed simultaneously — each parallel instance handles a subset of the partitions. Increasing parallelism helps drain a backlog but requires sufficient cluster resources and enough Kafka partitions to distribute work across.
5 / 5
A team says 'the streaming job is stalled'. What typically causes a streaming job to stall?
A stalled streaming job has stopped making forward progress even though data is available. Common causes include: backpressure from a slow or blocked downstream operator, a checkpoint that won't complete (blocking the whole pipeline), a deadlock in state access, or a resource exhaustion (OOM, disk full). Monitoring backpressure graphs and checkpoint duration is the first diagnostic step.