Apache Parquet's columnar storage uses sophisticated encoding schemes to achieve high compression and query performance. Master dictionary encoding, RLE, delta encoding, compression codecs, row groups, and page footers for building efficient data lake architectures.
0 / 5 completed
1 / 5
A Parquet file stores a column with many repeated values (e.g., a status column with only 3 distinct values). Which encoding is most efficient for this pattern?
Dictionary encoding is ideal for low-cardinality columns. Parquet builds a dictionary of distinct values and stores integer indices in the data pages. For a column with 3 distinct values across millions of rows, the dictionary fits in a few bytes, and each value requires only a 2-bit index — achieving extreme compression for repeated string or binary values.
2 / 5
Parquet's RLE_DICTIONARY encoding combines two techniques. What are they?
RLE_DICTIONARY applies Run-Length Encoding (RLE) on top of dictionary-encoded indices. When consecutive rows have the same dictionary index (same value), RLE stores the index once with a repeat count instead of repeating it. This compound encoding is Parquet's default for most string/binary columns with repetition.
3 / 5
A Parquet file stores a sorted integer column with values like [1000, 1005, 1010, 1015]. Which encoding exploits this pattern?
DELTA_BINARY_PACKED stores the first value, then the deltas (differences) between consecutive values, packed using bit-widths sized to the actual delta range. For regularly spaced integers (delta = 5 each time), deltas are tiny and pack into very few bits — far more efficient than storing the full integer values with PLAIN encoding.
4 / 5
A Parquet row group has a target size of 128 MB. What is the primary purpose of dividing a Parquet file into row groups?
A row group is a horizontal partition of rows within a Parquet file. Each row group stores column statistics (min/max values, null counts) in its metadata. Query engines use these statistics for predicate pushdown — skipping entire row groups that cannot satisfy a WHERE clause. Row groups also align with HDFS/S3 block sizes for parallel I/O.
5 / 5
A Parquet file is written with compression=SNAPPY at the page level. At what level does Parquet apply compression codecs?
Parquet applies compression codecs at the page level — each data page (typically 1 MB of encoded column data) is compressed independently. This enables efficient random access: a reader can decompress only the pages needed to satisfy a query, without decompressing the entire column chunk or file.