5 exercises — choose the best-structured answer to common Data Platform Engineer interview questions. Focus on lakehouse architecture, streaming pipelines, and governance.
Structure for data platform design questions
Distinguish batch vs. streaming: latency requirements determine the architecture pattern
Name components precisely: CDC, ETL vs. ELT, lakehouse, data contract, medallion
Cover operational concerns: SLA, data quality, lineage, access control
Address governance: domain ownership, catalogue, discovery, data contracts
0 / 5 completed
1 / 5
The interviewer asks: "Design a data lakehouse for a company with 100 TB of data, mixed batch and streaming ingestion, and both BI and ML workloads." Which answer best covers the key architecture considerations?
Option C is strongest: it specifies the table format with a reason (Iceberg for multi-engine access), defines the medallion layers, separates query engines by workload (BI vs. ML), covers governance at table and column level, adds data quality gates between layers, and includes cost optimisation. Option D's Lambda Architecture is an older pattern — the lakehouse largely obsoletes it by eliminating the separate serving layer. Lakehouse design: table format selection → ingestion per type → medallion architecture → query engine per workload → governance → data quality gates → cost model.
2 / 5
The interviewer asks: "Design a CDC pipeline from a production PostgreSQL database to a data warehouse, minimising load on the source database." Which answer best addresses design and operational requirements?
Option A is strongest: it explains why WAL-based replication is low-overhead (comparing overhead percentages), covers schema evolution with Schema Registry, addresses DELETE handling (a common warehouse CDC gotcha), details the initial snapshot challenge, names the replication slot lag risk (disk fill — a real production failure mode), and achieves exactly-once via idempotent upserts. Option D's timestamp polling is precisely what the question rules out — it requires index scans on updated_at and misses hard deletes entirely. CDC design: WAL-based approach → Debezium → schema evolution → DELETE handling → initial snapshot → replication slot management → exactly-once with idempotent upserts.
3 / 5
The interviewer asks: "Explain data mesh and when you would and wouldn't recommend it." Choose the most balanced and practical answer.
Option D is strongest: it explains all four pillars with the business problem each solves, specifies concrete "when to use" criteria (team size, autonomy, bottleneck symptoms), and gives three specific "when NOT to use" scenarios with reasons, plus the common failure mode (org change without platform investment). Options B and C describe pillars and vague scale criteria but don't show judgment about when the model is inappropriate. Data mesh answer: four pillars with problems they solve → when to use (size, autonomy, bottleneck) → when NOT to use (small org, regulated industries, low maturity) → common failure mode.
4 / 5
The interviewer asks: "How would you implement data quality monitoring for a pipeline that feeds both BI dashboards and ML model training?" Which answer demonstrates a complete data quality engineering approach?
Option B is strongest: it defines four quality dimensions with different check types, specifies check placement in the medallion pipeline, names tiered tooling (dbt + Great Expectations + Monte Carlo), defines severity tiers with concrete thresholds, includes SLA freshness tracking, adds ML-specific label distribution monitoring, and adds lineage-based upstream alert propagation. Option D describes data contracts correctly but a contract is the agreement — it doesn't explain how to implement monitoring. Data quality monitoring: four dimensions → check placement by layer → tiered tooling → severity threshold tiers → SLA freshness → ML label distribution → lineage-based upstream alerts.
5 / 5
The interviewer asks: "Design a real-time analytics pipeline that needs to answer 'revenue in the last 5 minutes by region' with sub-second response time." Which answer best covers the design requirements?
Option A is strongest: it covers the full pipeline (ingest → stream processing → serving store → query interface), compares serving store options (Druid vs. ClickHouse vs. Redis), specifies the freshness/latency trade-off with concrete numbers (1–5s freshness, sub-100ms query), includes fault tolerance (Flink checkpoint + Kafka offset replay), and adds backfill isolation. Option D's Lambda Architecture is overly complex — Druid and ClickHouse have made the separate batch accuracy layer unnecessary for this use case. Real-time analytics: ingest SLA → Flink windowed aggregation → serving store choice → query interface → freshness/latency numbers → fault tolerance → backfill isolation.