English for Data Platform Architects: Lakehouse, Medallion, and Data Contracts

Master the English vocabulary and natural design review phrases for modern data platform discussions: lakehouse, medallion architecture, data contracts, and lineage.

Modern data platform architecture has its own dialect. Terms like “medallion,” “data contract,” and “lakehouse” aren’t just buzzwords — they carry precise meaning, and using them correctly signals architectural fluency. This post focuses on the English vocabulary and natural phrases you need for data platform design reviews, architecture discussions, and cross-team alignment meetings.

Core Vocabulary

Data Lakehouse

A data lakehouse is an architecture that combines the flexibility and low cost of a data lake (raw files on object storage) with the structure and query performance of a data warehouse (schema enforcement, ACID transactions, indexing).

“We migrated from a pure data lake to a lakehouse architecture. Now we get ACID transactions and time travel on our S3 data without moving everything into Redshift.”

Key phrases: adopt a lakehouse architecture, the lakehouse layer, unify storage and compute, bring warehouse capabilities to the lake.

The word unify appears frequently in this domain — engineers speak about unifying storage, unifying the batch and streaming paths, unifying access patterns.

Medallion Architecture

Medallion architecture organizes data into three progressive layers named after precious metals: bronze, silver, and gold.

  • Bronze layer: raw, unvalidated data ingested directly from sources. Often called the landing zone or raw layer.
  • Silver layer: cleansed, conformed, and deduplicated data. Joins happen here. This is where most transformation logic lives.
  • Gold layer: business-ready, aggregated data optimized for consumption — dashboards, ML features, reporting.

“The analytics team is querying the gold layer directly. If they’re seeing data quality issues, the problem is upstream — check the silver layer transformations.”

“We ingest everything raw into bronze and never modify it. Bronze is append-only and serves as our audit log.”

Useful phrases: promote data through the layers, the data lands in bronze, cleanse at the silver layer, serve from gold.

Data Contract

A data contract is a formal, versioned agreement between a data producer and a data consumer that specifies the schema, semantics, quality guarantees, and SLAs of a dataset.

“The payments team broke our pipeline last Tuesday because they renamed a column without updating the data contract. We’re now requiring contract validation in their CI pipeline.”

“Before we onboard a new data source, we ask the producer to sign off on a data contract. It specifies the delivery cadence, the schema, and the acceptable null rate for each field.”

Verbs: define a contract, sign off on a contract, break a contract (when a producer violates it), validate against the contract, version the contract.

The phrase sign off on is important — it implies formal approval, not just awareness. You’ll use it in cross-team conversations: “Has the upstream team signed off on the schema changes?”

Schema Registry

A schema registry is a centralized service that stores and versions data schemas — typically for streaming pipelines using Avro, Protobuf, or JSON Schema. It enforces schema compatibility between producers and consumers.

“We enforce backward compatibility in the schema registry. A producer can add fields but cannot remove or rename them without a major version bump.”

Phrases: register a schema, check compatibility, backward/forward compatible, schema evolution, the registry rejects the incompatible schema.

Data Mesh

Data mesh is an organizational and architectural approach where data ownership is decentralized — each domain team owns and operates its own data products, rather than a central data team owning everything.

“We’re moving toward a data mesh model. The payments team will own and serve their own data products. The central platform team provides the infrastructure — the tools, the governance standards, the schema registry.”

Important distinction for conversations: data mesh is an organizational pattern, not a specific technology. Engineers sometimes blur this. In design reviews: “That’s a data mesh principle, not a lakehouse feature — the two are complementary but separate concepts.”

Data Product

In the data mesh context, a data product is a dataset or API treated with product thinking — it has an owner, a defined interface (schema + SLA), documentation, quality metrics, and a versioning policy.

“The recommendation team’s feature store is a first-class data product. It has a stated freshness SLA of 15 minutes, documented field semantics, and an owner who is on-call for it.”

Phrases: treat data as a product, the data product owner, publish a data product, consume a data product, data product SLA.

Data Quality SLA

A data quality SLA (Service Level Agreement) defines measurable commitments about the quality of a dataset — completeness, freshness, accuracy, and null rates.

“Our gold layer has a data quality SLA: 99.5% completeness on the user_id field, data freshness within 2 hours of source, and zero duplicate primary keys. Any violation triggers an alert and blocks downstream jobs.”

In conversations: meet the SLA, breach the SLA, define quality thresholds, monitor SLA compliance.

Lineage Tracking

Lineage tracking records the origin and transformation history of data — which source it came from, which pipelines transformed it, and which consumers depend on it.

“When the finance team found anomalies in the monthly report, lineage tracking let us trace the data back to a broken ETL job that ran three days earlier. Without lineage, that investigation would have taken days.”

Phrases: trace lineage, upstream lineage (where data came from), downstream lineage (what depends on it), column-level lineage, the lineage graph.

Table Format

A table format (Delta Lake, Apache Iceberg, Apache Hudi) is an open storage layer that adds ACID transactions, schema evolution, and time travel to files stored on object storage.

“We standardized on Iceberg as our table format. It gives us time travel for debugging, concurrent writes without conflicts, and hidden partitioning so query performance doesn’t depend on how the data was ingested.”

“Before we chose Iceberg, we evaluated Delta Lake and Hudi. Iceberg won on catalog interoperability — it works with both Spark and our Trino cluster without any connectors.”

Phrases: commit a transaction, time travel to a snapshot, hidden partitioning, table format interoperability.

Real IT Context: Phrases Engineers Actually Use

In architecture design reviews:

  • “Where does this data land first? Bronze? And when does it get promoted to silver?”
  • “Who owns this data product? If the SLA is breached at 2am, who gets paged?”
  • “We’re mixing medallion layers here — this transformation belongs in silver, not gold. Gold should be read-only aggregations.”

In cross-team alignment meetings:

  • “The contract says the schema is backward compatible, but you dropped a non-nullable field. That’s a breaking change.”
  • “We need lineage from the raw ingest all the way to the dashboard. Right now there’s a black hole between the ETL and the warehouse.”

In incident postmortems:

  • “We had no data quality SLA on this table, which is why the anomaly went undetected for a week. Going forward, every gold table needs completeness and freshness checks.”

Key Collocations

CollocationMeaning
promote data through layersmove data from bronze → silver → gold
land in bronzeinitial raw ingestion
break a data contractproduce data that violates the agreed schema or SLA
sign off on a schemaformally approve a schema change
trace column-level lineagefollow a field’s origin through transformations
breach the SLAfail to meet a data quality commitment
time travel to a snapshotquery historical state of a table via table format
treat data as a productapply product ownership and SLAs to datasets

Practice

Write a short architecture review comment (4-6 sentences) for a hypothetical pull request that introduces a new gold-layer table without a data contract or quality SLA. Use at least three terms from this post. Then write the reply from the author — where they agree and explain how they will address the feedback. This dialogue practice mirrors real design review interactions and forces you to use the vocabulary in both the “raising a concern” and “acknowledging feedback” registers, which are equally important in English-language engineering communication.