Intermediate Vocabulary #data-engineering #kafka #data-warehouse #pipelines

Data Engineering Vocabulary

5 exercises — the vocabulary every data engineer needs in English: medallion architecture, idempotency, Kafka streaming, data mesh, and slowly changing dimensions.

Core data engineering vocabulary clusters

Architecture patterns: medallion (Bronze/Silver/Gold), data lakehouse, data mesh, Lambda/Kappa architecture
Pipeline reliability: idempotency, upsert, deduplication, at-least-once, exactly-once, checkpoint
Streaming: Kafka, topic, partition, offset, consumer group, lag, retention, replay
Storage: Delta Lake, Iceberg, Hudi, Parquet, partitioning, compaction, ACID
Warehouse modelling: fact table, dimension, SCD Type 2, surrogate key, star schema, snowflake schema
Tooling: Spark, dbt, Airflow, Dagster, Fivetran, Great Expectations, data catalog

0 / 5 completed

1 / 5

A data engineer explains their pipeline:
"We use a medallion architecture — raw data lands in the Bronze layer, gets cleaned and deduplicated in Silver, and the Gold layer contains business-level aggregates ready for dashboards."
What is the medallion architecture?

2 / 5

A senior data engineer reviews a pipeline design:
"This job isn't idempotent — if it runs twice you'll get duplicate records in the output table. You need to use upsert logic or a deduplication step."
What does idempotent mean in data pipelines?

3 / 5

A team discusses streaming architecture:
"We use Apache Kafka as the backbone. Every service publishes events to topics, and consumers read at their own pace. The events are retained for 7 days so we can replay if a downstream system has a bug."
What is event retention and why does it matter?

4 / 5

A data team lead presents:
"We've adopted a data mesh approach — each domain team owns and publishes their data as a product. Payments owns the transactions dataset, Inventory owns the stock dataset. Central data platform provides the infrastructure."
What is data mesh?

5 / 5

A data engineer describes a common problem:
"We had a slowly changing dimension issue — a customer changed their city, so their historical orders started showing the new city instead of the one they lived in at the time. We needed to implement SCD Type 2."
What is a slowly changing dimension (SCD)?