Practise vocabulary for data lakes, lakehouses, medallion architecture, Delta Lake, Apache Iceberg, and ACID transactions.
0 / 5 completed
1 / 5
The medallion architecture (bronze/silver/gold layers) organises data as:
Medallion architecture progressively improves data quality: Bronze stores source-faithful raw data (audit, replay). Silver applies cleansing, deduplication, type casting, and joins. Gold produces business-level aggregations optimised for specific use cases (finance reporting, product analytics).
2 / 5
A data lakehouse combines a data lake and a data warehouse by:
The lakehouse pattern: data stored as Parquet files in S3/ADLS/GCS, with a metadata layer (Delta Lake transaction log, Iceberg catalog) providing ACID guarantees, schema evolution, and time travel. Query engines (Spark, Trino, DuckDB) read from object storage directly.
3 / 5
ACID transactions in a lakehouse context (Delta Lake, Iceberg) provide:
Object storage (S3) has no transaction support: concurrent writers can corrupt each other's data. Delta Lake and Iceberg add a transaction log (DeltaLog / Iceberg metadata) that serialises writes, enabling safe concurrent operations that plain Parquet-on-S3 cannot support.
4 / 5
Time travel in Delta Lake or Apache Iceberg allows:
Time travel example: SELECT * FROM orders VERSION AS OF 5 or SELECT * FROM orders TIMESTAMP AS OF '2024-01-15'. Delta Lake retains transaction log history; Iceberg retains metadata snapshots. Both enable reproducing the exact data state at any past point.
5 / 5
Schema-on-read (data lake) compared to schema-on-write (data warehouse) means:
Data lake (schema-on-read): ingest anything, figure out the schema when querying — flexible ingestion, complex queries. Data warehouse (schema-on-write): enforce schema on insert — guaranteed data quality, simpler queries. Lakehouses bridge this: store like a lake, query like a warehouse.