5 exercises — choose the best-structured answer to common data quality engineering interview questions. Focus on validation frameworks, test architecture, data contracts, lineage, and anomaly detection.
Structure for data quality interview answers
Name the framework objects: Great Expectations has Expectations, Suites, Validators, Checkpoints — use the correct terms
Distinguish test types: schema tests vs singular tests vs custom generic tests have different scopes and reuse patterns
Cover enforcement layers: data contracts need schema, runtime, and CI enforcement — not just documentation
Address tooling specifics: dbt, OpenLineage, Soda, DataHub — name the config options and integration points
0 / 5 completed
1 / 5
The interviewer asks: "Explain how Great Expectations works — how do you define expectations, validate data, and integrate validation into a data pipeline?" Which answer best covers Great Expectations architecture?
Option B covers all six layers: core object model (Expectation, Suite, Validator, Data Context), expectation categories (column, table, multi-column) with specific method names, Data Docs HTML output, Checkpoint as the production unit with three Action types (Slack, store result, rebuild docs), Airflow integration with the exception-raising failure behaviour, and auto-profiling as a bootstrap strategy. Options A, C, D each describe the concept correctly but don't cover the Checkpoint execution model, Action types, or Airflow integration mechanics.
2 / 5
The interviewer asks: "Compare dbt singular tests and generic tests — how do you write each, and when would you use custom generic tests vs schema.yml assertions?" Which answer best covers dbt test architecture?
Option B covers all seven dimensions: schema test YAML syntax with a concrete column example, singular test pattern (0-row-returning SQL), custom generic tests with the macro pattern and parameters, dbt-utils and dbt-expectations package capabilities with specific function names, test severity (warn vs error with YAML syntax), --store-failures for failure investigation, and test selection syntax for CI. Options A, C, D each describe the two test types correctly but don't cover custom generic test macros, severity levels, store-failures, or test selection syntax.
3 / 5
The interviewer asks: "What is a data contract, how do you specify one, and how do you enforce it in a modern data stack?" Which answer best covers data contract architecture?
Option B covers all six dimensions: the four components of a contract (schema, semantics, SLA, ownership), ODCS specification format with concrete YAML fields, dbt model contracts (contract: {enforced: true}), the producer-consumer workflow with the parallel migration pattern for breaking changes, three enforcement layers (dbt materialisation, runtime GE/Soda, CI PR checks), and the contract registry pattern with tooling. Options A, C, D each identify the concept correctly but don't cover specification formats, the migration workflow, CI enforcement, or the contract registry pattern.
4 / 5
The interviewer asks: "Explain column-level lineage — how is it different from table-level lineage, why does it matter for impact analysis, and how do modern tools capture it?" Which answer best covers data lineage depth?
Option B provides the complete picture: the precise difference between table and column lineage with concrete column-path examples, a quantified impact scale (50-200 downstream columns across 30+ models), lineage capture mechanisms (SQL parsing with specific tools: dbt manifest, Marquez, OpenLineage, DataHub, SQLGlot), the ColumnLineageDatasetFacet spec, the SELECT * lineage-breaking problem with the best-practice fix, the dbt + Airflow + DataHub integration stack, and freshness SLA propagation as an advanced use case. Options A, C, D each identify the use case correctly but provide no capture mechanisms, tooling specifics, or the SELECT * problem.
5 / 5
The interviewer asks: "Compare Z-score, IQR, and ML-based approaches to detecting data quality anomalies in a pipeline — when would you use each?" Which answer best covers data anomaly detection?
Option B provides mechanical definitions of all three approaches (Z-score formula, IQR Tukey fence formula), specific limitations (Z-score masking effect from outlier-inflated σ, IQR seasonal false positives), ML algorithm options beyond just Prophet (ARIMA, Isolation Forest for multivariate, LSTM autoencoder), data requirements (2-4 weeks minimum), a layered practical framework (Z-score/IQR for volume, ML for KPIs), alert fatigue warning, and specific tooling (Soda Cloud, Monte Carlo, dbt-utils.recency, custom Prophet on information_schema). Options A, C, D each name the right algorithms but don't explain the masking effect, seasonal false positives, multivariate options, or tooling.