English Vocabulary for ClickHouse Analytics

Learn the professional English vocabulary for ClickHouse — MergeTree, ARRAY JOIN, argMax, quantile(), materialized views, ReplicatedMergeTree, and the Kafka engine in real analytics team discussions.

ClickHouse is a columnar OLAP database built for extreme analytical query performance — capable of scanning billions of rows per second on a single server. It is the database of choice for teams running real-time analytics pipelines, product metrics systems, and event tracking at scale. ClickHouse has a rich vocabulary of its own: table engines, aggregate functions, and SQL extensions that do not exist in standard databases. If you work in a data or analytics engineering team that uses ClickHouse, mastering this vocabulary is critical for writing efficient queries, designing schemas, and participating in performance discussions. This post covers the most important ClickHouse terms.

Key Vocabulary

MergeTree The most fundamental and widely used ClickHouse table engine. MergeTree tables store data sorted by a primary key and periodically merge data parts in the background, which enables very fast range scans and aggregations. Almost all production ClickHouse tables use MergeTree or a variant of it. Example: “Use MergeTree() for the events table with ORDER BY (user_id, event_time) — sorting by user then timestamp lets us scan a single user’s history with almost no I/O.”

ARRAY JOIN A ClickHouse SQL extension that unnests array columns into individual rows, similar to UNNEST in PostgreSQL. It is essential for working with event data that stores multiple values (tags, properties, items) in array fields. Example: “Use ARRAY JOIN to explode the tags array — we need one row per tag to count how many events have each tag applied.”

argMax An aggregate function that returns the value of one column for the row where another column is at its maximum. It is extremely useful for getting the “latest value” of a field without a subquery — for example, finding each user’s most recent profile state. Example: “Use argMax(status, updated_at) to get each user’s current status from the events table — it finds the status value from whichever row has the largest updated_at.”

quantile() An aggregate function for computing approximate percentiles across large datasets. quantile(0.95)(response_time_ms) computes the 95th percentile response time. ClickHouse uses sampling for speed, making it practical on billions of rows. Example: “Add quantile(0.99)(latency_ms) AS p99_latency to the monitoring query — we care more about the 99th percentile than the average for SLA tracking.”

toStartOfInterval() A date/time function that truncates a timestamp to the start of a time bucket — hour, day, week, or custom intervals. Essential for time-series aggregations and dashboards. Example: “Group by toStartOfInterval(event_time, INTERVAL 5 MINUTE) to get a time-series with 5-minute buckets — it’s more flexible than toStartOfHour() when you need custom granularity.”

Materialized view A ClickHouse object that automatically pre-aggregates data as it is inserted, writing results to a target table. Unlike in other databases, ClickHouse materialized views are trigger-based — they run on every INSERT — making them suitable for maintaining real-time aggregates. Example: “Create a materialized view that pre-aggregates event counts by user and day as data lands — the dashboard query will hit the aggregate table instead of scanning the raw events.”

ReplicatedMergeTree A variant of MergeTree that replicates data across multiple ClickHouse nodes using Apache ZooKeeper or ClickHouse Keeper for coordination. Required for high availability and used in production clusters where data loss is not acceptable. Example: “Upgrade the events table to ReplicatedMergeTree before launch — the single-node MergeTree is fine for development but we need replication for production durability.”

Kafka engine A ClickHouse table engine that reads data directly from a Kafka topic. You create a Kafka engine table, then pair it with a materialized view that writes the consumed messages to a MergeTree table, creating a streaming ingestion pipeline entirely within ClickHouse. Example: “Set up a Kafka engine table pointing to the page_views topic and a materialized view to land the data into MergeTree — we don’t need a separate Flink job for this ingestion.”

How to Use This Vocabulary

ClickHouse schema design discussions revolve around the choice of table engine, the ORDER BY key (which determines the physical sort order and query performance), and whether to use materialized views for pre-aggregation. A well-formed schema proposal sounds like: “I’ll use ReplicatedMergeTree with ORDER BY (tenant_id, event_type, event_time) and a materialized view to pre-aggregate daily counts per tenant.”

Query optimization discussions frequently mention argMax as a pattern for “latest record” lookups, quantile() for percentile metrics, and ARRAY JOIN for unnesting. Understanding why these exist — and what alternatives they replace — lets you contribute meaningfully to query reviews.

Example Conversation

Nina: The dashboard query is scanning 200 billion rows on every load. It’s taking 12 seconds. Marco: Do we have a materialized view for daily aggregates? The dashboard doesn’t need raw event rows. Nina: Not yet. I’ll create one using toStartOfInterval() bucketed by day and argMax for the latest status per user. Marco: Good. Make sure the target table is ReplicatedMergeTree — we’re on the production cluster.

Practice

  1. Write a ClickHouse query that computes the 95th and 99th percentile of a response_time_ms column grouped by hour using toStartOfInterval() and quantile(). Read it aloud using the vocabulary from this post.
  2. Explain to a SQL-fluent colleague who is new to ClickHouse what ARRAY JOIN does and how it differs from a standard SQL JOIN. Give a concrete example involving an array of product tags.
  3. Describe a materialized view strategy for a high-volume events table: what data it pre-aggregates, what the target table engine should be, and why this is faster than querying raw data for dashboards.