Practice the vocabulary of orchestrating a multi-step data pipeline as a directed acyclic graph of tasks.
0 / 5 completed
1 / 5
At standup, a dev mentions defining a data pipeline as a set of tasks with explicit dependencies between them, forming a graph with no cycles, so a scheduler can determine the correct execution order. What is this graph called?
A directed acyclic graph, or DAG, in Apache Airflow defines a data pipeline as a set of tasks with explicit dependencies between them, forming a graph with no cycles so a scheduler can determine the correct execution order automatically. A single monolithic script running every task sequentially in a fixed order has no way to express that two tasks could safely run in parallel, or to recover cleanly if one specific task fails. This dependency-graph structure is the foundational concept underlying how Airflow schedules and orchestrates a pipeline's tasks.
2 / 5
During a design review, the team wants a failed task to automatically retry a configured number of times with a delay between attempts, rather than immediately marking the entire DAG run as failed. Which capability supports this?
Per-task retry configuration with a retry count and a delay between attempts lets a failed task retry automatically a configured number of times, rather than immediately marking the entire DAG run as failed on its first failure. Marking the whole run as failed immediately, with no retry, treats a transient, recoverable error the same as a genuinely permanent failure. This per-task retry behavior is what lets an Airflow DAG tolerate an occasional transient failure in one of its individual tasks without failing the entire pipeline run.
3 / 5
In a code review, a dev notices a DAG is scheduled with a defined interval and Airflow automatically backfills a run for every past interval that hasn't yet been executed since the DAG's start date. What does this represent?
Scheduled backfilling of a missed DAG run interval automatically executes a run for every past interval that hasn't yet been processed since the DAG's start date, once that DAG is enabled. Running a DAG only manually, with no automatic backfilling, leaves a gap for any interval that wasn't explicitly triggered by a person. This backfilling behavior is a key part of how Airflow reconciles a DAG's schedule with what's actually been executed so far.
4 / 5
An incident report shows a transient network error caused a single task partway through a large DAG to fail, and because that task had no retry configured, the entire DAG run was marked failed and had to be manually restarted from the beginning. What practice would prevent this?
Configuring a sensible retry count and delay on the individual task lets a transient error be retried automatically before the entire DAG run is marked failed, avoiding exactly the manual restart this incident describes. Leaving every task with no retry configuration treats a fleeting, recoverable error the same as a genuine, permanent failure. This per-task retry configuration is a standard resilience practice for any DAG that includes a task prone to an occasional transient error, like a network call to an external service.
5 / 5
During a PR review, a teammate asks why the team defines its pipeline as an Airflow DAG instead of a single monolithic script running every step sequentially in a fixed order. What is the reasoning?
A monolithic script has no way to express that two independent steps could run in parallel, or to recover cleanly from just one step's failure without restarting everything from scratch. A DAG models explicit dependencies between tasks and lets Airflow schedule, retry, and backfill individual tasks based on that graph. The tradeoff is the added complexity of learning and maintaining Airflow's own scheduling and DAG-authoring concepts compared to writing a single straightforward script.