Reading Data Flow Descriptions
5 exercises on reading data pipeline architecture documents: ingestion, deduplication windows, enrichment, dead-letter queues, at-least-once delivery guarantees, and columnar data warehouses.
Reading data pipeline documents
- Identify the source (where data enters), transformations (what happens to it), and sink (where it lands)
- Ingestion layer → receives raw data, writes to a queue/topic for buffering
- Stream processing → real-time transformations: deduplicate, enrich, normalise
- Batch loader → periodic bulk transfer to the data warehouse
- At-least-once → no losses, but duplicates possible; exactly-once → no losses, no duplicates
0 / 5 completed
1 / 5
Read this data pipeline description and answer the question:
Analytics Ingestion Pipeline
Raw clickstream events are produced by client applications (web and mobile) and sent to an ingestion layer consisting of an HTTP endpoint fronted by a load balancer. The endpoint validates the event schema and writes each event to a Kafka topic (
A stream processing layer (Apache Flink jobs) consumes the
A batch loader runs every 15 minutes. It reads all records from
What is the purpose of the 30-second sliding window in the Flink deduplication step?Raw clickstream events are produced by client applications (web and mobile) and sent to an ingestion layer consisting of an HTTP endpoint fronted by a load balancer. The endpoint validates the event schema and writes each event to a Kafka topic (
raw-events) with minimal latency — the goal is never to block the client.A stream processing layer (Apache Flink jobs) consumes the
raw-events topic. Each Flink job performs: (1) deduplication using a 30-second sliding window keyed by event_id, (2) enrichment by joining with a user dimension table from Redis, and (3) normalisation of timestamp fields to UTC. The processed events are written to the enriched-events topic.A batch loader runs every 15 minutes. It reads all records from
enriched-events accumulated since the last run and bulk-loads them into a columnar data warehouse (BigQuery). Analysts query the warehouse using SQL dashboards.The sliding window detects duplicate events arriving within 30 seconds by tracking seen
In real-world event streaming, duplicates are common:
event_id values.In real-world event streaming, duplicates are common:
- A mobile app retries a failed HTTP request — the server receives the same event twice
- Network issues cause the Kafka producer to retry — the same event is written twice
- Client-side bugs can fire the same event multiple times
- Flink keeps a state store of
event_idvalues seen in the last 30 seconds - When a new event arrives, Flink checks if its
event_idis already in the state store - If yes → duplicate, discard
- If no → unique, process and add the ID to the state store
- The window "slides" — IDs older than 30 seconds are evicted from the state store to prevent unbounded memory growth
- ingestion → receiving and accepting data into the system
- deduplication → removing duplicate records from a stream
- sliding window → a time window that moves continuously, tracking recent events
- keyed by → grouped/partitioned by a specific field for per-key state