Apache Parquet Column Encoding: English for Data Engineering Discussions
Learn the English vocabulary data engineers use when discussing Apache Parquet column encodings, row groups, page formats, and storage optimization strategies.
Apache Parquet is the columnar storage format at the heart of most modern data lakes, and teams that work with it regularly need to discuss encoding strategies, compression trade-offs, and file layout optimisation. This guide covers the technical vocabulary data engineers use when tuning Parquet files and discussing storage strategies in code reviews and architecture meetings.
Core Vocabulary
Dictionary encoding A compression technique that replaces repeated values in a column with small integer IDs that reference a dictionary lookup table. Dictionary encoding is highly effective for low-cardinality columns such as status codes, categories, or country names.
“The status column has only eight distinct values, so dictionary encoding reduces it from 4 bytes per row to 1 byte — a 75% size reduction.”
RLE (Run-Length Encoding) A compression technique that replaces consecutive repeated values with a single value and a count. In Parquet, RLE is often combined with bit packing to compress the integer IDs produced by dictionary encoding.
“After dictionary encoding maps our status values to IDs 0-7, RLE compresses long runs of the same status — perfect for our time-series data where status rarely changes between consecutive rows.”
Bit packing A technique that stores integers using only the minimum number of bits required to represent their range. If a column only ever holds values 0-15, each value needs just 4 bits instead of 32.
“Bit packing the dictionary IDs for a 5-value column means we use 3 bits per row instead of 32 — across a billion rows, that’s a significant saving.”
Delta encoding An encoding that stores the difference between consecutive values rather than the values themselves. Delta encoding is effective for monotonically increasing columns such as timestamps, auto-incremented IDs, or sequential event counters.
“We applied delta encoding to the event_id column — since IDs increment by 1, the deltas are almost all 1, and RLE then compresses those runs very efficiently.”
Byte stream split An encoding for floating-point values that interleaves the bytes from multiple values — grouping all first bytes together, all second bytes together, and so on — to improve compression ratios by making each byte stream more uniform.
“Our sensor readings are IEEE 754 floats, and byte stream split improved our Zstd compression ratio by 30% because the exponent bytes now compress together.”
Page The smallest addressable unit within a Parquet file. A page belongs to a column chunk and contains either data values, a dictionary, or an index. Pages are the level at which compression and encoding are applied.
“The query skipped three pages entirely using the page index — it didn’t need to decompress and decode them because the min/max statistics showed the target value wasn’t in those pages.”
Row group A horizontal partition of rows in a Parquet file. A row group contains one column chunk per column. Row group size determines the granularity of statistics-based pruning during query execution.
“We increased the row group size from 128 MB to 512 MB — larger row groups mean better compression but more memory needed during write.”
Column chunk All the encoded and compressed data for a single column within a single row group. Column chunks are the unit read when a query selects specific columns, enabling column pruning.
“A query that reads only the timestamp and user_id columns touches just two column chunks per row group, skipping all the other columns entirely.”
Bloom filter A probabilistic data structure stored in Parquet files that allows readers to quickly determine whether a value is definitely not present in a row group, without reading and decompressing the full column chunk.
“We added bloom filters to the user_id column — point lookups for a specific user now skip 99% of row groups immediately instead of scanning column statistics.”
Key Collocations
- apply dictionary encoding — “Apply dictionary encoding to any string column with fewer than 100,000 distinct values — above that threshold, the dictionary overhead outweighs the benefit.”
- enable bloom filters — “Enable bloom filters on the primary key columns if your workload involves frequent single-row lookups by ID.”
- tune row group size — “We tuned row group size to match the memory available to the writer jobs — too large and the writers OOM, too small and readers lose statistics benefit.”
- read a column chunk — “Parquet readers read a column chunk at a time, not row by row — your query plan should select only the columns you need.”
- skip pages via predicate pushdown — “The engine uses page-level statistics and bloom filters to skip pages via predicate pushdown, so a WHERE clause on an indexed column avoids most I/O.”
- compress with Snappy/Zstd — “We compress with Zstd at level 3 — better ratio than Snappy with acceptable decompression speed for our analytical workloads.”
Using This Vocabulary in Practice
When discussing Parquet optimisation, the phrase “at the page level” and “at the row group level” appear constantly. These phrases locate the scope of a setting or problem: “Statistics are tracked at the row group level, but bloom filters operate at the page level — make sure you understand which one is being applied.” Precision about level saves a lot of confusion in debugging sessions.
The phrase “statistics-based pruning” is the general term for the family of optimisations that use min/max values and bloom filters to skip data without reading it. You will also hear this called “predicate pushdown” — the query engine pushes filter conditions down into the file reader so that unnecessary data is never loaded.
When comparing encoding strategies, engineers use the pattern: “X is effective for Y-type columns because of Z.” For example: “Delta encoding is effective for timestamp columns because consecutive timestamps are close in value, making the deltas small and highly compressible.”
Practice Tip
Open a Parquet file using a tool like DuckDB or PyArrow and inspect its metadata — specifically the encoding type, compression codec, and row group count. Describe what you see in English, using at least four terms from this article. This exercise connects abstract vocabulary to concrete observable file properties, which is exactly the context in which you will use these words in real team discussions.