5 exercises — choose the best-structured answer to common Infrastructure Engineer interview questions. Focus on precise vocabulary, correct use of technical terms, and demonstrating real experience.
Structure for infrastructure engineering interview answers
Name the IaC pattern: remote backend, workspace isolation, policy as code — explain which problem each solves
Explain state backend choice: name the specific backend and its locking mechanism, not just "remote state"
Address blast radius: always explain how you scope the impact of a failed apply or a misconfigured module
Cite automation over manual intervention: IAM restrictions, CI enforcement, scheduled drift detection — not "discipline"
0 / 5 completed
1 / 5
The interviewer asks: "How do you manage Terraform state in a team environment, and what are the risks?" Which answer best demonstrates Terraform state management expertise?
Option B is strongest: it explains why remote state is mandatory (not just "better"), names all three backend options with their specific locking mechanisms, explains the sensitive data risk with concrete examples (DB passwords, API keys) and the required controls, explains why workspaces are risky for production (shared backend config, wrong-environment risk), names the stale-state problem that locking alone does not solve, and introduces CI as the enforcement mechanism and drift as an incident category. Key structure: remote backend mandatory → S3/DynamoDB + GCS + TF Cloud options → sensitive data in plaintext + controls → separate state files vs workspaces for prod → locking + CI serialisation → drift detection as incident. Option C is accurate but does not explain the workspace risk or the stale-state problem. Option D is surface-level — does not explain the workspace isolation trade-off or the CI enforcement pattern.
2 / 5
The interviewer asks: "How do you detect and respond to infrastructure drift in an IaC environment?" Which answer best demonstrates drift management expertise?
Option B is strongest: it defines drift with all three causes (not just console changes), names three specific detection tools with their different scopes (terraform plan for managed resources, AWS Config for attribute-level compliance, Driftctl for unmanaged resources), frames drift response as an incident with owner assignment and two decision paths (codify or revert), introduces the reconciliation principle at the IAM enforcement level (not just "discipline"), and warns about the dangerous edge case (reviewing plan before applying a drift fix to avoid unintended destruction). Key structure: drift causes (console, API, other tools) → three detection tools at different scopes → incident response with two decision paths → IAM enforcement over discipline → plan review before fix to avoid resource destruction → cultural codification rule. Option C is accurate and covers IAM restriction but does not explain the resource destruction risk or name Driftctl. Option D mentions Driftctl but does not explain the IAM enforcement principle or the plan-review risk.
3 / 5
The interviewer asks: "How do you test infrastructure-as-code before deploying to production?" Which answer best covers a multi-layer IaC testing strategy?
Option B is strongest: it names four distinct layers with specific tools for each, explains why each layer exists (what it catches that the previous layer misses), gives the time/cost profile (static: seconds; unit: minutes), explains OPA/Sentinel at the plan JSON level (not just "policy checking"), introduces contract testing as a module interface stability tool (often missed), and states the blast radius principle with the ephemeral environment rule. Key structure: four layers (static → unit → policy as code → contract) → specific tools per layer → what each catches that others miss → plan JSON as OPA input → contract testing for module interfaces → ephemeral test environment. Option C is accurate and covers all four concepts but does not explain what each layer catches that others miss, or introduce contract testing. Option D is similar — accurate but does not explain layer differentiation or contract testing.
4 / 5
The interviewer asks: "How do you design networking for a multi-cloud or hybrid-cloud environment?" Which answer best demonstrates multi-cloud networking expertise?
Option B is strongest: it frames the problem across three dimensions (connectivity, security, routing), explains the transitive routing limitation of VPC peering with the specific solution (Transit Gateway/Virtual WAN), contrasts IPSec VPN vs dedicated interconnect with decision criteria (latency sensitivity), introduces Private Link/Private Service Connect as the service-level alternative to full network peering, notes the stateful vs stateless security group difference across clouds (a real operational gotcha), explains BGP route filtering for hybrid, and gives the latency measurement principle. Key structure: VPC peering limitation → Transit Gateway/Virtual WAN → IPSec VPN vs dedicated interconnect decision criteria → Private Link for service-level isolation → stateful vs stateless security semantics → BGP route filtering → measure actual RTT. Option C is accurate and covers most points but does not explain the stateful/stateless security group difference or Private Service Connect on GCP. Option D does not introduce Private Link or the security semantics difference.
5 / 5
The interviewer asks: "How do you approach cloud cost optimisation at scale without sacrificing reliability?" Which answer best demonstrates FinOps thinking?
Option B is strongest: it names FinOps as the cultural framework (not just a set of tools), gives specific percentile-based rightsizing methodology (14-day CPU/memory percentile — not just "look at utilisation"), quantifies the purchasing options (Savings Plans: 30-40% discount, Spot: up to 90% with the 2-minute notice constraint), explains the tagging taxonomy with specific tag keys and the distinction between showback and chargeback, and critically explains the Spot reliability trade-off with the specific failure modes to avoid (stateful databases, single-node prod). Key structure: FinOps culture first → rightsizing with percentile methodology → purchasing ladder (On-Demand/Savings Plans/Reserved/Spot) with discounts → idle resource automation → tagging taxonomy for showback vs chargeback → Spot reliability trade-off (graceful shutdown, no stateful databases) → anomaly detection. Option C is accurate and covers showback/chargeback but does not quantify the discounts or explain the Spot reliability constraints. Option D is accurate but does not give the percentile rightsizing approach or the showback vs chargeback distinction.