Practice reading and describing data pipeline metrics — events per second, ingestion rates, consumer lag, job SLAs. 5 exercises.
0 / 5
Exercise 1 of 5
A data engineer describes their pipeline's capacity to a stakeholder.
"The pipeline currently processes 50,000 events per second at peak load, ingests approximately 3 TB of raw data daily, and has an end-to-end processing latency of under 4 minutes from event ingestion to the data warehouse."
Which of the following is the most concise and accurate paraphrase of this statement?
Option B is correct because it: • Translates "50,000 events per second" into plain speech: "50 thousand events every second" • Explains "ingests 3 TB daily" in natural phrasing: "store about 3 terabytes of raw data per day" • Re-states "end-to-end processing latency of under 4 minutes" as a practical outcome: "data is available in the warehouse within 4 minutes of being collected"
Why option C is weaker: It just lists the numbers without connecting them to meaning. The goal of paraphrasing data pipeline metrics is to help the listener understand the capability and impact.
Key vocabulary: • events per second (EPS) — the rate at which events (logs, messages, clicks, transactions) enter a system • ingestion — the process of collecting and importing data into a storage or processing system • end-to-end latency — time from when data enters the pipeline to when it reaches its final destination • raw data — unprocessed data before transformation, cleaning, or aggregation • data warehouse — a central repository storing structured, processed data for analytics
Common IT phrases for pipeline capacity: • "The pipeline handles X events per second." • "We process roughly X terabytes of data per day." • "The end-to-end latency is currently around X minutes."
Exercise 2 of 5
During a sprint review, a data engineer says:
"Last week our Kafka consumer lag spiked to 2.8 million messages during the Tuesday morning traffic peak. Our consumers were processing 12,000 messages per second but the producers were writing at 18,500 per second — a 54% throughput gap."
What is the engineer describing?
Consumer lag explained: Consumer lag is the difference between where the producer is writing (the latest offset) and where the consumer has read up to (the committed offset). A lag of 2.8 million means 2.8 million messages are waiting to be processed.
Why this matters: • Producers: 18,500 msg/sec → Consumers: 12,000 msg/sec • Throughput gap: (18,500 − 12,000) ÷ 18,500 ≈ 35% (the engineer rounded to "54%" — this is the gap relative to consumer throughput: (18,500 − 12,000) ÷ 12,000... actually ≈ 54%) • At this rate, the backlog grows by 6,500 messages every second = 390,000/min = 23M/hour
Key vocabulary: • Kafka — a distributed event streaming platform; widely used for high-throughput data pipelines • consumer lag — how far behind a Kafka consumer is from the latest message; a key health metric • producer — the component writing messages to Kafka (or a queue) • consumer — the component reading and processing messages from Kafka • offset — the position of a message in a Kafka partition (like a log index) • throughput gap — the difference between production rate and consumption rate
Useful phrases: • "Our consumer lag spiked to X million messages during the peak window." • "Producers are outpacing consumers — we need to scale up the consumer group." • "We need to provision more consumer instances to close the throughput gap."
Exercise 3 of 5
A data platform team reports their monthly metrics in a business review:
"In April we processed 4.7 billion events — up 23% month-over-month. Average batch job duration was 47 minutes with a p95 of 2 hours 14 minutes. The pipeline SLA of 99.5% data freshness within 1 hour was met on 29 out of 30 days."
Which phrase best describes the pipeline's reliability performance?
Option C is correct because it: • States the SLA achievement as a rate: 29/30 = 96.7% of days (not just "mostly") • Names the metric being measured: "1-hour data freshness SLA" • Combines the reliability result with the growth context that makes it meaningful • Uses the correct phrase "missed it once" rather than saying "was late" (which is informal)
Calculating SLA compliance: 29 out of 30 days = 96.7% of days — which is BELOW the stated 99.5% SLA (if interpreted as a monthly target). This is a nuance worth flagging in a review.
Key vocabulary: • data freshness — how recently data was updated; often expressed as a maximum allowed lag (e.g. "data must be under 1 hour old") • batch job — a process that runs at scheduled intervals, processing data in bulk (vs real-time streaming) • p95 duration — 95% of batch jobs complete within this time; 5% take longer • SLA compliance — the degree to which a service meets its agreed-upon targets • month-over-month (MoM) — comparison between the current month and the previous month
Common data pipeline review phrases: • "We processed X events this month — up Y% month-over-month." • "The pipeline met its SLA on X out of Y days." • "P95 job duration is within our 4-hour budget."
Exercise 4 of 5
An engineer writes the following in a design doc:
"The proposed pipeline must handle peak write throughput of 85,000 records/second, support horizontal scaling to 500 GB/hour ingestion, and maintain end-to-end latency under 500ms for 99% of events."
Which set of numbers correctly paraphrases this capacity requirement?
Option A is correct because it: • States the read unit correctly: "per second" not "per minute" • Converts 500 GB/hour to a daily figure for context: 500 × 24 = 12,000 GB = ~12 TB/day • Explains "end-to-end latency under 500ms for 99% of events" as "99% of events processed within half a second" — this is clearer for a non-technical stakeholder
Why option D is insufficient: Option D just repeats the abbreviations without explaining them. Design docs should be precise, but verbal summaries and stakeholder communications need translation.
Key vocabulary: • write throughput — the rate at which new data is written to a system • horizontal scaling — adding more machines to handle load (vs vertical scaling = larger machine) • ingestion rate — how fast data enters the system; often expressed in GB/hour or records/sec • end-to-end latency — total time from data creation to final availability in the destination • p99 latency — "latency under 500ms for 99% of events" = p99 latency ≤ 500ms
Exercise 5 of 5
A team's pipeline performance review shows the following for the past month:
Metric
Target
Actual
Daily volume
≥ 2B events/day
2.4B events/day
Data freshness (p99)
≤ 30 min
22 min
Job success rate
≥ 99.5%
97.8%
Failed job recovery (MTTR)
≤ 15 min
41 min
Which metrics are not meeting their SLAs?
Reading the table: • Daily volume: 2.4B ≥ 2B target ✅ Exceeds target • Data freshness: 22 min ≤ 30 min target ✅ Meets target (with 8-min buffer) • Job success rate: 97.8% < 99.5% target ❌ Below target by 1.7 percentage points • Failed job recovery (MTTR): 41 min > 15 min target ❌ Almost 3× the target
Business impact: A 97.8% job success rate means 2.2% of jobs fail. If running 1,000 jobs per day, that's 22 failed jobs per day. An MTTR of 41 minutes instead of 15 means each failure causes 26 extra minutes of delay — directly impacting data freshness for downstream consumers.
Key vocabulary: • SLA breach — when an actual metric value falls short of the agreed target • MTTR (Mean Time To Recovery) — average time to recover from a failed job or service; also used as "Mean Time To Repair" • job success rate — percentage of scheduled pipeline jobs that complete without errors • data freshness — how current the data is; described as a maximum allowed age or lag • downstream consumers — systems, dashboards, or teams that depend on pipeline output
How to report this at a meeting: • "We missed two SLAs this month: job success rate was 97.8% against a 99.5% target, and MTTR for failed jobs was 41 minutes — almost three times our 15-minute target." • "We need to focus on pipeline stability — volume and freshness are green, but reliability is lagging."