English Vocabulary for Apache Iceberg Table Format
Master the English vocabulary data engineers use with Apache Iceberg — snapshots, time travel, schema evolution, hidden partitioning, and catalog integration explained.
Apache Iceberg has become the standard open table format for lakehouse architectures, replacing older formats like Hive tables for analytical workloads. Data engineers working with Spark, Flink, Trino, or DuckDB need to communicate precisely about Iceberg’s unique concepts — and the vocabulary in technical documentation, data team Slack channels, and conference talks follows consistent patterns worth learning.
Key Vocabulary
Table Format A table format defines how data files, metadata, and schema information are organized on storage. Iceberg is an open table format — it is not a database engine but a specification for how engines read and write data. Engineers say Iceberg “defines,” “specifies,” or “implements” a table format. Example: “We migrated from Hive-style tables to the Iceberg table format to get proper ACID transactions on our data lake.”
Snapshot A snapshot represents the complete state of a table at a specific point in time. Each write operation creates a new snapshot. Engineers “take,” “create,” or “roll back to” snapshots. Snapshots are the foundation of Iceberg’s time travel and isolation features. Example: “After the bad ETL run, we rolled back to the previous snapshot to restore the table to its last known good state.”
Manifest A manifest is a metadata file that lists a subset of the data files in a snapshot along with their statistics (record count, min/max values). Manifests are grouped into a manifest list. Engineers talk about Iceberg “writing,” “pruning,” and “reading” manifests during query planning. Example: “Iceberg uses the min/max statistics in the manifest to skip irrelevant data files without scanning them.”
Time Travel Time travel is the ability to query a table as it existed at a past snapshot or timestamp. Engineers “run a time travel query,” “use time travel,” or “query as of” a specific point. Example: “I ran a time travel query against the sales table as of yesterday morning to compare it with today’s figures.”
Schema Evolution
Schema evolution is Iceberg’s ability to add, rename, drop, or reorder columns without rewriting existing data files. Engineers “perform,” “apply,” or “support” schema evolution. A key advantage over older formats is that Iceberg tracks columns by ID, not name.
Example: “We applied schema evolution to add a customer_segment column without touching the 10 TB of existing Parquet files.”
Hidden Partitioning
Hidden partitioning means Iceberg manages partition values automatically from a source column using a partition spec (e.g., bucket, truncate, year/month/day transforms). Unlike Hive, users do not need to write partition filter predicates manually. Engineers “configure” or “define” a partition spec.
Example: “With hidden partitioning on the event_timestamp column, queries automatically benefit from partition pruning without any special syntax.”
Copy-on-Write vs Merge-on-Read These are two strategies for handling row-level updates and deletes. Copy-on-write rewrites entire data files on each update (fast reads, slow writes). Merge-on-read writes small delta files and merges at read time (fast writes, slower reads). Engineers “choose,” “configure,” or “benchmark” these write modes. Example: “For our high-frequency update workload we chose merge-on-read, but for the reporting layer we use copy-on-write for faster queries.”
Catalog Integration An Iceberg catalog tracks table locations and metadata. Supported catalogs include REST, Hive Metastore, AWS Glue, and Nessie. Engineers “configure,” “integrate with,” or “register tables in” a catalog. Example: “We integrated the Iceberg REST catalog so both Spark and Trino can discover and query the same tables consistently.”
Common Phrases and Collocations
“run a time travel query”
The standard phrase for querying historical data. Always “run” — not “execute a time travel” or “do time travel.”
Example: “Run a time travel query using FOR SYSTEM_TIME AS OF TIMESTAMP '2026-06-01' to see the table state before last week’s migration.”
“snapshot isolation” Refers to Iceberg’s guarantee that each query sees a consistent snapshot of the table, preventing dirty reads. Used in discussions about concurrency and data consistency. Example: “Snapshot isolation means that a long-running analytical query won’t see partial results from a concurrent write.”
“evolve the schema” The preferred phrasing for making schema changes in Iceberg. “Evolve” implies backward-compatible change — use it when adding columns or renaming. Example: “We can safely evolve the schema to add the new field; existing queries that don’t reference it won’t be affected.”
“expire snapshots” Old snapshots accumulate over time. Teams “expire snapshots” as a maintenance operation to free up storage. Example: “Schedule a weekly job to expire snapshots older than 30 days to keep storage costs under control.”
“partition pruning” The optimization where the query engine skips data files that cannot contain matching rows based on partition metadata. Example: “Partition pruning reduced the query scan from 500 GB to 3 GB because only two daily partitions matched the filter.”
Practical Sentences to Practice
- “The Iceberg table format gave us schema evolution and time travel without any application changes.”
- “We configured the partition spec to bucket on
user_idwith 256 buckets to distribute the data evenly.” - “After expiring old snapshots and running orphan file cleanup, the table’s storage footprint dropped by 40 percent.”
- “The REST catalog allows both our Spark jobs and our Trino cluster to share the same table definitions.”
- “I need to roll back to the snapshot from before the pipeline run — can you give me the snapshot ID?”
Common Mistakes to Avoid
Saying “Iceberg database” instead of “Iceberg table” Iceberg defines a table format, not a database. The catalog holds tables; Iceberg itself has no concept of a database engine. Say “Iceberg table” or “Iceberg-formatted table.”
Confusing “manifest” and “metadata file” In Iceberg, metadata has a hierarchy: metadata file → manifest list → manifests → data files. A manifest is not the same as the top-level metadata file. Be specific when discussing which layer you are referring to.
Using “partition” as a verb carelessly Say “partition the table by date” when configuring the partition spec, but avoid “Iceberg partitioned my query” — the correct phrase is “Iceberg pruned partitions during query planning.”
Summary
Apache Iceberg’s vocabulary — snapshots, time travel, schema evolution, hidden partitioning, manifest files, and catalog integration — is the shared language of modern data lakehouse engineering. Using these terms precisely helps you write accurate design documents, communicate clearly in data team standups, and contribute effectively to open-source projects in the Iceberg ecosystem. The best resource for deepening this vocabulary is the official Apache Iceberg documentation and the recorded talks from Data + AI Summit and Current conference, where engineers describe real-world Iceberg deployments in natural, professional English.