Data Pipeline Numbers in English — Reading & Describing Pipeline Metrics

Exercise 1 of 5

A data engineer describes their pipeline's capacity to a stakeholder.

"The pipeline currently processes 50,000 events per second at peak load, ingests approximately 3 TB of raw data daily, and has an end-to-end processing latency of under 4 minutes from event ingestion to the data warehouse."

Which of the following is the most concise and accurate paraphrase of this statement?

Exercise 2 of 5

During a sprint review, a data engineer says:

"Last week our Kafka consumer lag spiked to 2.8 million messages during the Tuesday morning traffic peak. Our consumers were processing 12,000 messages per second but the producers were writing at 18,500 per second — a 54% throughput gap."

What is the engineer describing?

Exercise 3 of 5

A data platform team reports their monthly metrics in a business review:

"In April we processed 4.7 billion events — up 23% month-over-month. Average batch job duration was 47 minutes with a p95 of 2 hours 14 minutes. The pipeline SLA of 99.5% data freshness within 1 hour was met on 29 out of 30 days."

Which phrase best describes the pipeline's reliability performance?

Exercise 4 of 5

An engineer writes the following in a design doc:

"The proposed pipeline must handle peak write throughput of 85,000 records/second, support horizontal scaling to 500 GB/hour ingestion, and maintain end-to-end latency under 500ms for 99% of events."

Which set of numbers correctly paraphrases this capacity requirement?

Exercise 5 of 5

A team's pipeline performance review shows the following for the past month:

Metric	Target	Actual
Daily volume	≥ 2B events/day	2.4B events/day
Data freshness (p99)	≤ 30 min	22 min
Job success rate	≥ 99.5%	97.8%
Failed job recovery (MTTR)	≤ 15 min	41 min

Which metrics are not meeting their SLAs?

Exercise complete!