Advanced Reading #data-pipeline #kafka #stream-processing #data-engineering

Reading Data Flow Descriptions

5 exercises on reading data pipeline architecture documents: ingestion, deduplication windows, enrichment, dead-letter queues, at-least-once delivery guarantees, and columnar data warehouses.

Reading data pipeline documents

Identify the source (where data enters), transformations (what happens to it), and sink (where it lands)
Ingestion layer → receives raw data, writes to a queue/topic for buffering
Stream processing → real-time transformations: deduplicate, enrich, normalise
Batch loader → periodic bulk transfer to the data warehouse
At-least-once → no losses, but duplicates possible; exactly-once → no losses, no duplicates

0 / 5 completed

1 / 5

Read this data pipeline description and answer the question:

Analytics Ingestion Pipeline

Raw clickstream events are produced by client applications (web and mobile) and sent to an ingestion layer consisting of an HTTP endpoint fronted by a load balancer. The endpoint validates the event schema and writes each event to a Kafka topic (raw-events) with minimal latency — the goal is never to block the client.

A stream processing layer (Apache Flink jobs) consumes the raw-events topic. Each Flink job performs: (1) deduplication using a 30-second sliding window keyed by event_id, (2) enrichment by joining with a user dimension table from Redis, and (3) normalisation of timestamp fields to UTC. The processed events are written to the enriched-events topic.

A batch loader runs every 15 minutes. It reads all records from enriched-events accumulated since the last run and bulk-loads them into a columnar data warehouse (BigQuery). Analysts query the warehouse using SQL dashboards.

What is the purpose of the 30-second sliding window in the Flink deduplication step?

2 / 5

Re-read the Analytics Ingestion Pipeline description and answer the question:

A stream processing layer (Apache Flink jobs) consumes the raw-events topic. Each Flink job performs: (1) deduplication, (2) enrichment by joining with a user dimension table from Redis, and (3) normalisation of timestamp fields to UTC.

A batch loader runs every 15 minutes. It reads all records from enriched-events accumulated since the last run and bulk-loads them into a columnar data warehouse (BigQuery). Analysts query the warehouse using SQL dashboards.

What is "enrichment" in the context of this pipeline, and why is the user dimension stored in Redis rather than queried from the main database?

3 / 5

Read this description of pipeline failure handling and answer the question:

Failure handling and replay

Kafka topics are configured with a retention period of 7 days. This means that even if the Flink stream processing layer is unavailable for hours, no data is lost — when Flink recovers, it resumes reading from the last committed offset and processes the backlog.

The batch loader similarly uses consumer group offsets: each run commits its offset after a successful BigQuery load. If a batch fails mid-load, the next run re-reads from the last committed offset, ensuring no events are skipped. BigQuery's MERGE statements handle any duplicates introduced by a partial re-run.

This combination of Kafka retention and offset management provides an at-least-once delivery guarantee: every event will reach BigQuery at least once, though duplicates are possible and must be deduplicated downstream.

What does "at-least-once delivery" mean in this context?

4 / 5

Read this data transformation description and answer the question:

Transformation steps in the Flink job

After deduplication, each event passes through a series of transformation steps:

1. Schema validation: Events that do not conform to the registered Avro schema are routed to a dead-letter topic (raw-events-dlq) for manual inspection. Valid events continue.

2. Field normalisation: Timestamps are converted to UTC. Currency amounts are converted to minor units (pence, cents). Country codes are normalised to ISO 3166-1 alpha-2.

3. PII masking: Any field tagged as personally identifiable information (email, IP address, device fingerprint) is hashed using SHA-256 before the event is written to enriched-events. The raw values are never written downstream.

What is the purpose of the dead-letter topic (raw-events-dlq)?

5 / 5

Read this sink and serving layer description and answer the question:

Data sink: BigQuery data warehouse

BigQuery is a columnar data warehouse optimised for analytical queries over large datasets. Unlike row-oriented databases (such as PostgreSQL), BigQuery stores data column-by-column. This means that a query summing the revenue column across 500 million rows only reads the revenue column from disk — it does not read order IDs, user IDs, or product names.

Analysts run queries on two sets of tables:
• raw tables — a direct copy of enriched-events, partitioned by day
• aggregated tables — pre-computed summaries (daily active users, revenue by country) refreshed hourly by dbt transformation jobs

The aggregated tables serve the business dashboards. The raw tables are used for ad-hoc investigation and building new aggregations.

Why does BigQuery's columnar storage make analytical queries faster than a row-oriented database?