Intermediate 6 topic areas 84+ exercises

Data Engineer

Data engineers build the plumbing that powers analytics and ML. This path covers the vocabulary for designing pipelines, discussing partitioning strategies, documenting data contracts, and communicating with data scientists and analysts who depend on reliable, well-described data.

Topics covered

  • ETL/ELT pipelines
  • Data warehouse design
  • Streaming & batch
  • Orchestration
  • Data quality
  • Data contracts

Vocabulary spotlight

4 terms every Data Engineer should know in English:

lineage n.

The record of where data comes from, how it transforms, and where it goes

"Our data lineage tool shows every transformation between raw source and the BI dashboard."
idempotent pipeline n.

A pipeline that can be re-run multiple times without producing duplicate or incorrect data

"Make every stage idempotent so we can safely replay any failed run."
partitioning n.

Dividing a large dataset into smaller, independent parts to improve query performance

"Partitioning by event_date reduced our query costs by 70%."
schema evolution n.

Managing changes to a data schema over time without breaking downstream consumers

"We use Avro for schema evolution — new fields default to null for old records."
Open full glossary →

📚 Vocabulary Reference

Key terms organised by category for Data Engineers:

Pipeline Fundamentals

ETLELTpipelineDAGtaskoperatorbackfillidempotentretry logicSLA

Storage & Architecture

data lakedata warehousedata lakehouseOLAPOLTPcolumnar storagepartitioningbucketingcompactionDelta table

Streaming

event streamtopicpartitionoffsetconsumer groupproducerat-least-onceexactly-oncebackpressurewatermark

Data Quality

data contractschema evolutionlineagefreshnesscompletenessaccuracynull rateanomaly detectiondata testexpectation
Study full vocabulary modules →

Recommended exercises

Real-world scenarios you'll practise

  • Explaining pipeline failure and data loss to stakeholders in a post-mortem
  • Designing a data contract interface with an analytics team
  • Presenting a data lakehouse migration plan to engineering leadership
  • Documenting SLA expectations for a data pipeline

🎯 Interview questions specific to this role

Practise answering these questions out loud — or in writing. Each question targets a real interviewer concern for Data Engineers.

  1. What is the difference between ETL and ELT, and when does each approach make sense?
  2. How do you handle schema changes without breaking downstream consumers?
  3. Walk me through how you would design a real-time data pipeline.
  4. What strategies do you use to ensure data quality at scale?
  5. How do you balance pipeline idempotency with performance?
Practice all interview exercises →

Recommended reading

Explore another role

🎮 Game Developer

Open path →