Data Engineer
Data engineers build the plumbing that powers analytics and ML. This path covers the vocabulary for designing pipelines, discussing partitioning strategies, documenting data contracts, and communicating with data scientists and analysts who depend on reliable, well-described data.
Topics covered
- ETL/ELT pipelines
- Data warehouse design
- Streaming & batch
- Orchestration
- Data quality
- Data contracts
Vocabulary spotlight
4 terms every Data Engineer should know in English:
The record of where data comes from, how it transforms, and where it goes
"Our data lineage tool shows every transformation between raw source and the BI dashboard."
A pipeline that can be re-run multiple times without producing duplicate or incorrect data
"Make every stage idempotent so we can safely replay any failed run."
Dividing a large dataset into smaller, independent parts to improve query performance
"Partitioning by event_date reduced our query costs by 70%."
Managing changes to a data schema over time without breaking downstream consumers
"We use Avro for schema evolution — new fields default to null for old records."
📚 Vocabulary Reference
Key terms organised by category for Data Engineers:
Pipeline Fundamentals
Storage & Architecture
Streaming
Data Quality
Recommended exercises
Real-world scenarios you'll practise
- Explaining pipeline failure and data loss to stakeholders in a post-mortem
- Designing a data contract interface with an analytics team
- Presenting a data lakehouse migration plan to engineering leadership
- Documenting SLA expectations for a data pipeline
🎯 Interview questions specific to this role
Practise answering these questions out loud — or in writing. Each question targets a real interviewer concern for Data Engineers.
- What is the difference between ETL and ELT, and when does each approach make sense?
- How do you handle schema changes without breaking downstream consumers?
- Walk me through how you would design a real-time data pipeline.
- What strategies do you use to ensure data quality at scale?
- How do you balance pipeline idempotency with performance?