Build fluency in the vocabulary of scheduling and coordinating interdependent data tasks.
0 / 5 completed
1 / 5
At standup, a data engineer mentions defining a set of interdependent data-processing tasks as a graph, so a scheduler runs each one in the correct order once its dependencies complete. What is this tool category called?
A data pipeline orchestrator, like Airflow or Dagster, lets an engineer define a set of interdependent data-processing tasks as a graph, automatically running each task in the correct order once its upstream dependencies have successfully completed. A single sequential script has no built-in awareness of which tasks can run in parallel or which specifically depend on which. This dependency-aware scheduling is central to reliably running a complex, multi-step data pipeline.
2 / 5
During a design review, the team wants a failed task to automatically retry a limited number of times before the whole pipeline run is marked as failed. Which capability supports this?
Automated task retry with a configured limit lets a transient failure, like a brief network timeout, be automatically retried a set number of times before the pipeline is actually marked as failed. Marking the entire pipeline as permanently failed on the very first failure treats every failure as unrecoverable, even a genuinely temporary one. This retry mechanism improves the pipeline's overall reliability without requiring manual intervention for every minor, self-resolving hiccup.
3 / 5
In a code review, a dev notices the orchestrator tracks each task's execution history and lets a specific past run be inspected or re-triggered independently. What does this represent?
Pipeline run history and re-triggering tracks each individual execution of the pipeline, letting an engineer inspect a specific past run's logs and outcome or re-trigger just that run independently if needed. Discarding history immediately after completion removes the ability to debug a past failure or understand a pipeline's behavior over time. This historical tracking is essential for diagnosing an intermittent issue that only appears in certain runs.
4 / 5
An incident report shows a downstream task started processing data before an upstream task had actually finished writing it, producing incomplete, corrupted output. What practice would prevent this?
Configuring the orchestrator so a downstream task only starts after its upstream dependency is confirmed complete ensures the data it processes is actually the finished, correct output rather than a partial, in-progress write. Assuming a fixed schedule always lines up correctly ignores that an upstream task's actual duration can vary and isn't perfectly predictable. This dependency-based triggering, rather than a purely time-based schedule, is exactly what a proper orchestrator is designed to guarantee.
5 / 5
During a PR review, a teammate asks why the team uses a dedicated orchestrator instead of chaining these data tasks together with a set of cron jobs. What is the reasoning?
A set of independently scheduled cron jobs has no built-in awareness of whether an upstream job actually finished successfully before a downstream one starts, and no coordinated retry or run-history tracking across the whole pipeline. A dedicated orchestrator manages all of that dependency logic, retry behavior, and history tracking as a first-class capability. The tradeoff is the added operational overhead of running and maintaining the orchestrator platform itself.