dbt Snapshots and SCD2: English Vocabulary for Data Engineers
Learn English vocabulary for dbt snapshots and slowly changing dimensions to speak confidently in data engineering standups and design reviews.
Data engineers working with dbt often find themselves in standups, design reviews, and retrospectives where precise English vocabulary matters as much as technical knowledge. Terms like “SCD type 2,” “grain,” and “invalidated rows” have specific meanings that native English speakers use in exact ways — and misusing them can cause real confusion. This post teaches you the language patterns around dbt snapshots and slowly changing dimensions so you can follow discussions, ask good questions, and explain your decisions clearly.
Key Vocabulary
Slowly Changing Dimension (SCD) — A dimension table in a data warehouse where values change over time and you need to track history. Engineers distinguish between SCD “type 1” (overwrite), “type 2” (add a new row), and others. You’ll hear “we’re implementing SCD2 here” or “this is modeled as a type-2 dimension.”
“The customer address table needs to be SCD2 — we can’t just overwrite when someone moves, because finance needs historical shipping addresses for their reports.”
Snapshot materialization — The dbt-specific mechanism for capturing point-in-time records from a source table. Engineers say a model “uses snapshot materialization” or refer to “our snapshots” as a category of dbt models.
“We’ve got a snapshot materialization on the orders table — it runs every hour and captures any changes to status or amount.”
Unique key — The column or combination of columns that identifies a single record in a snapshot. In dbt, you configure this in the snapshot block. Native speakers say “the unique key for this snapshot is” or ask “what’s the grain of this snapshot?”
“We’re using
order_idas the unique key, but be careful — there are some legacy duplicate IDs in the source system that we need to deduplicate upstream.”
Check strategy — A dbt snapshot strategy that detects changes by comparing specified columns. Engineers say “we’re using the check strategy” or “the check strategy watches these columns for changes.”
“We went with the check strategy here and listed out the five columns we actually care about — otherwise a noisy audit column was triggering false-positive changes every hour.”
Timestamp strategy — A dbt snapshot strategy that detects changes by comparing an updated_at timestamp column. You’ll hear “the timestamp strategy relies on a reliable updated_at” or “we can’t use timestamp strategy because this source doesn’t update that field.”
“The timestamp strategy is simpler to reason about, but it falls apart if the source system backfills records without updating the timestamp.”
Invalidated rows — In SCD2, when a record changes, the old version is “invalidated” — its dbt_valid_to column is set to the current timestamp. Engineers say rows “get invalidated” or refer to “the invalidation logic.”
“One thing that tripped us up: when a record is deleted in the source, dbt doesn’t invalidate the snapshot row by default — you have to handle hard deletes separately.”
Hard deletes — Records that are physically deleted from a source system, not just marked as inactive. This is a common problem with snapshots. Engineers say “we need to handle hard deletes” or “soft deletes vs. hard deletes.”
“The CRM team confirmed they do hard deletes on deactivated accounts, so we need the
invalidate_hard_deletesconfig or we’ll have zombie records in our snapshot.”
Grain — The level of detail that one row in a table represents. Engineers ask “what’s the grain of this table?” to understand what one record means. It’s a fundamental data modeling concept.
“Before we build this snapshot, let’s align on the grain — is it one row per order, or one row per order line item?”
Surrogate key — A system-generated identifier (usually a hash) used instead of a natural business key, especially in SCD2 where the same business entity has multiple rows. Engineers say “we add a surrogate key” or “the surrogate key is a hash of the unique key plus the snapshot timestamp.”
“We’re generating the surrogate key with
dbt_utils.generate_surrogate_key— downstream models join on that so they always get the right version.”
Effective dating — The practice of tracking when a record version became active and when it was superseded, using dbt_valid_from and dbt_valid_to columns. You’ll hear “effective dating,” “point-in-time joins,” or “as-of queries.”
“The finance team wants to reconstruct what the customer record looked like at invoice date — that’s a classic effective dating use case, which is exactly what our snapshot handles.”
Phrases in Context
Explaining a design choice in a standup:
“We went with the check strategy rather than timestamp because the source system’s
updated_atisn’t reliable — it doesn’t get bumped on certain batch updates. So we’re explicitly watching the three columns that matter: status, amount, and assigned_rep.”
Flagging a hard delete problem in a review:
“One thing to call out in this snapshot: if a record gets hard deleted in Salesforce, we won’t know about it. The row just stays open in our snapshot with no
dbt_valid_to. Do we have a process to reconcile that, or should we addinvalidate_hard_deletes: true?”
Clarifying grain during a design discussion:
“Before we snapshot this table, I want to make sure we’re aligned on grain. Is this one row per subscription, or one per subscription per billing cycle? The answer changes the unique key configuration significantly.”
Documenting a decision:
“We’re using SCD type 2 here because the analytics team needs to join on historical customer segments — overwriting would lose that history and break their quarter-over-quarter cohort analysis.”
Discussing performance in a retrospective:
“The snapshot job is getting slower as the table grows — it’s doing a full scan every run to detect changes. We might want to look at incremental strategies or partition the source before we snapshot it.”
Key Collocations
Native speakers combine these words in specific ways. Learn them as fixed phrases:
- implement SCD2 (not “create SCD2” or “do SCD2”)
- capture changes / detect changes (snapshots “capture” or “detect” changes)
- open record — a snapshot row with no
dbt_valid_to(still active): “the open record for this customer” - close a record — set
dbt_valid_to, marking a version as superseded - point-in-time join — joining to get the value as of a specific date
- backfill records — reprocess historical data: “if they backfill records without updating the timestamp…”
- downstream models — dbt models that depend on the snapshot: “downstream models join on the surrogate key”
- trigger a snapshot run / the snapshot picks up changes
Practice
Go to the dbt documentation page on snapshots and read the “How dbt Snapshots Work” section. For each of the ten vocabulary terms in this post, find where dbt uses that concept (even if with slightly different wording). Then write a three-paragraph explanation — as if you are explaining the snapshot configuration to a new team member in a Slack message — using at least six terms from this post. Writing in Slack format (slightly informal, concise) is different from writing documentation, and practicing both styles will help you in real work situations.