5 exercises — the vocabulary every data engineer needs in English: medallion architecture, idempotency, Kafka streaming, data mesh, and slowly changing dimensions.
Core data engineering vocabulary clusters
Architecture patterns: medallion (Bronze/Silver/Gold), data lakehouse, data mesh, Lambda/Kappa architecture
Storage: Delta Lake, Iceberg, Hudi, Parquet, partitioning, compaction, ACID
Warehouse modelling: fact table, dimension, SCD Type 2, surrogate key, star schema, snowflake schema
Tooling: Spark, dbt, Airflow, Dagster, Fivetran, Great Expectations, data catalog
0 / 5 completed
1 / 5
A data engineer explains their pipeline: "We use a medallion architecture — raw data lands in the Bronze layer, gets cleaned and deduplicated in Silver, and the Gold layer contains business-level aggregates ready for dashboards." What is the medallion architecture?
Medallion architecture (popularised by Databricks) is a data organisation pattern for data lakehouses with three progressively refined layers. Bronze (raw) — unmodified source data, kept as-is for reprocessing. Silver (cleaned/conformed) — deduplicated, type-cast, validated, joined across sources. Gold (aggregated) — business-level metrics, KPIs, wide tables ready for BI tools. Benefits: auditability (raw data always available), progressive refinement, separate compute for each layer, easier debugging. Related data engineering vocabulary: Data lakehouse — combines data lake storage economics with data warehouse SQL/ACID capabilities. Delta Lake — an open-source storage layer adding ACID transactions and time-travel to data lakes. ELT (Extract, Load, Transform) — load raw data first, transform in place (vs ETL: transform before loading). Idempotency — running a pipeline multiple times produces the same result. Data freshness — how recent the data in each layer is. In conversation: "All our raw Kafka events land in Bronze — we never delete from there, so we can always replay from source of truth."
2 / 5
A senior data engineer reviews a pipeline design: "This job isn't idempotent — if it runs twice you'll get duplicate records in the output table. You need to use upsert logic or a deduplication step." What does idempotent mean in data pipelines?
Idempotency is a critical property for reliable data pipelines: running the same pipeline multiple times should produce the same result as running it once. Without idempotency, a pipeline failure and retry introduces duplicate data. How to achieve idempotency: Upsert (MERGE) — insert if not exists, update if exists, based on a unique key. Overwrite partitions — write replaces the partition entirely rather than appending. Deduplication — filter out records with duplicate keys at write time. Audit columns — record load timestamp so you can identify and remove reloaded rows. Checkpoint — track how far a streaming job has processed so it can resume from the right offset, not from the beginning. In Spark: spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic"). Idempotency is also important in API design: a POST that creates a resource should be idempotent if given a client-generated unique id. In data: "Our warehouse loads are fully idempotent — we can re-run any day's load without worry."
3 / 5
A team discusses streaming architecture: "We use Apache Kafka as the backbone. Every service publishes events to topics, and consumers read at their own pace. The events are retained for 7 days so we can replay if a downstream system has a bug." What is event retention and why does it matter?
Event retention in Kafka defines how long messages are stored on the broker before deletion — by time (e.g., 7 days) or by size. This enables replay: if a downstream consumer has a bug and processes messages incorrectly, it can reset its offset and re-read from an earlier point. Key Kafka vocabulary: Topic — a named, ordered log of messages. Partition — a topic is split into partitions for parallel processing. Offset — the position of a message within a partition (monotonically increasing). Producer — a process that writes to a topic. Consumer group — a set of consumers that share the reading of a topic. Lag — how far behind a consumer group is from the latest message. Compacted topic — retains only the latest message per key (for stateful data like user profiles). At-least-once delivery — messages are guaranteed to arrive at least once (consumer must handle deduplication). Exactly-once semantics — Kafka Streams and transactional producers can guarantee exactly-once. In conversation: "We keep 30 days of retention on our payment events topic so we can audit or replay any transaction."
4 / 5
A data team lead presents: "We've adopted a data mesh approach — each domain team owns and publishes their data as a product. Payments owns the transactions dataset, Inventory owns the stock dataset. Central data platform provides the infrastructure." What is data mesh?
Data mesh (Zhamak Dehghani, 2019) is a decentralised data architecture with four principles: Domain ownership — each business domain owns and governs its own data. Data as a product — domain teams treat their datasets as products: discoverable, reliable, well-documented, SLA-backed. Self-serve data platform — central team provides infrastructure (storage, pipelines, tooling) that domain teams use without needing central data engineers. Federated computational governance — global standards (interoperability, security, privacy) enforced automatically. Data mesh vs data warehouse: Centralised warehouse — a central team ingests and transforms data from all domains; bottleneck at scale. Data mesh — domain teams produce data products; platform provides standardised infrastructure. In practice: each domain publishes standardised datasets (data products) with metadata, quality SLAs, and access controls. Other data vocabulary: Data catalog — a searchable inventory of all datasets with metadata (Datahub, Collibra, dbt docs). Data lineage — tracking the origin and transformations of data. dbt (data build tool) — a transformation framework for writing SQL models with testing and documentation.
5 / 5
A data engineer describes a common problem: "We had a slowly changing dimension issue — a customer changed their city, so their historical orders started showing the new city instead of the one they lived in at the time. We needed to implement SCD Type 2." What is a slowly changing dimension (SCD)?
A Slowly Changing Dimension (SCD) is a data warehouse concept for handling dimension attributes that change over time — customer addresses, employee departments, product prices. The challenge: when a value changes, do you overwrite it (losing history) or preserve it (how)? SCD strategies: Type 1 (Overwrite) — update the record in place; no history kept. Simplest, but historical analysis will show incorrect data. Type 2 (Add new row) — insert a new row for the changed version, with start/end dates and an is_current flag. Preserves full history. Most common for analytical workloads. Type 3 (Add column) — add a "previous value" column alongside the current value. Limits history to one version back. Type 4 (History table) — keep current values in the main table, all historical versions in a separate history table. Key SCD vocabulary: Surrogate key — a system-generated unique key for each dimension row (as opposed to the natural/business key). Effective/expiry dates — date range during which a SCD Type 2 row was valid. is_current flag — boolean marking the active version. In conversation: "We implemented SCD Type 2 on the customer dimension so our cohort analysis accurately reflects where customers lived when they made each purchase."