Master Iceberg's vocabulary — snapshot model, manifest hierarchy, time travel queries, and write strategies.
0 / 5 completed
1 / 5
At standup, a junior data engineer asks what an Iceberg snapshot is. What is correct?
An Iceberg snapshot is a lightweight, immutable pointer to the table's state at a specific moment. It contains a snapshot-id, a reference to a manifest list, and metadata like the operation type and summary statistics. New snapshots are appended on every write — no data is copied — making time travel and rollback efficient.
2 / 5
In a PR review, a teammate asks to clarify the difference between a manifest list and a manifest file. What is correct?
Iceberg has a two-level metadata indirection: a manifest list (one per snapshot) lists all manifest files for that snapshot. Each manifest file lists a subset of the table's data files (Parquet/ORC/Avro) along with partition values and column-level min/max statistics. Query engines use the statistics for partition pruning and predicate pushdown without reading data files.
3 / 5
An incident requires querying data as it was yesterday. How do you perform a time travel query in Spark with Iceberg?
Iceberg supports time travel natively in Spark SQL via two syntaxes: AS OF TIMESTAMP 'timestamp' for timestamp-based travel, and VERSION AS OF snapshot_id for snapshot-based travel. No restore or data copy is needed — Iceberg resolves the historical snapshot from its metadata chain and reads the corresponding data files.
4 / 5
During a design review on write performance, the team debates copy-on-write vs merge-on-read. What is correct?
Copy-on-write (CoW): updates and deletes rewrite the affected Parquet files immediately, so reads are fast (no merging needed) but writes are expensive. Merge-on-read (MoR): writes produce small delete files (equality or positional deletes) that are cheap to write, but reads must merge the base files with delete files at query time. The trade-off is write vs. read amplification.
5 / 5
In a code review, a teammate removes the Iceberg catalog configuration. What role does a catalog play?
The catalog is essential for table discovery and current-snapshot tracking. It maintains the authoritative mapping: table name → current metadata.json location. When a write commits, the catalog atomically updates this pointer to the new metadata file. Without a catalog, multiple engines cannot safely discover or concurrently update Iceberg tables.