Advanced Reading #data-pipeline #kafka #stream-processing #data-engineering

Reading Data Flow Descriptions

5 exercises on reading data pipeline architecture documents: ingestion, deduplication windows, enrichment, dead-letter queues, at-least-once delivery guarantees, and columnar data warehouses.

Reading data pipeline documents
  • Identify the source (where data enters), transformations (what happens to it), and sink (where it lands)
  • Ingestion layer → receives raw data, writes to a queue/topic for buffering
  • Stream processing → real-time transformations: deduplicate, enrich, normalise
  • Batch loader → periodic bulk transfer to the data warehouse
  • At-least-once → no losses, but duplicates possible; exactly-once → no losses, no duplicates
0 / 5 completed
1 / 5

Read this data pipeline description and answer the question:

Analytics Ingestion Pipeline

Raw clickstream events are produced by client applications (web and mobile) and sent to an ingestion layer consisting of an HTTP endpoint fronted by a load balancer. The endpoint validates the event schema and writes each event to a Kafka topic (raw-events) with minimal latency — the goal is never to block the client.

A stream processing layer (Apache Flink jobs) consumes the raw-events topic. Each Flink job performs: (1) deduplication using a 30-second sliding window keyed by event_id, (2) enrichment by joining with a user dimension table from Redis, and (3) normalisation of timestamp fields to UTC. The processed events are written to the enriched-events topic.

A batch loader runs every 15 minutes. It reads all records from enriched-events accumulated since the last run and bulk-loads them into a columnar data warehouse (BigQuery). Analysts query the warehouse using SQL dashboards.
What is the purpose of the 30-second sliding window in the Flink deduplication step?