5 exercises — practice structured English answers for DBRE interviews: SLOs for databases, replication incidents, capacity planning, zero-downtime migration, and disaster recovery.
How to structure DBRE interview answers
Database SLOs: define SLIs (replication lag, query latency, connection availability) → set SLO windows → error budget
The interviewer asks: "How would you define and measure an SLO for a Postgres primary/replica setup?" Which answer is most complete?
Option B is strongest: it names four SLIs explicitly with the Postgres-specific measurement queries for each, provides concrete SLO values with error budget calculations, and explains the error budget burn-rate alert for the on-call use case. The SLO dashboard description shows how to make the information actionable. DBRE vocabulary:SLI (Service Level Indicator) — a specific, measurable metric. pg_stat_statements — Postgres extension tracking query execution statistics. pg_stat_replication — Postgres view showing replication status and lag. PgBouncer — connection pooler for PostgreSQL. Burn rate — the rate at which the error budget is being consumed. Write lag — time between WAL write on primary and receipt on replica. Options C and D are accurate but lack the Postgres-specific measurement tools and the error budget calculation.
2 / 5
The interviewer asks: "Walk me through how you'd diagnose and resolve a major replication lag incident." Which answer demonstrates the most structured incident approach?
Option B is strongest: it names a four-step diagnostic process with specific Postgres commands at each step, distinguishes WAL transfer lag from replay lag (and the different root causes for each), explains the hot standby query conflict mechanism (a subtle Postgres-specific behaviour), and gives both immediate (cancel queries, throttle writes) and long-term (batch size) resolutions. DBRE vocabulary:WAL (Write-Ahead Log) — Postgres's transaction log; the basis for replication. WAL receiver / WAL applier — processes on the replica handling WAL receipt and replay. Hot standby — a Postgres replica that accepts read queries. Query conflict — when a read query on the replica conflicts with WAL replay (e.g., reading a row being vacuumed on the primary). max_standby_streaming_delay — how long the replica waits before cancelling conflicting queries to proceed with WAL replay. Options C and D are accurate but lack the hot standby conflict mechanism explanation.
3 / 5
The interviewer asks: "What's your strategy for safe schema migration on a database that cannot tolerate downtime?" Which answer is most complete?
Option B is strongest: it categorises changes by their locking impact, explains the Postgres 11+ improvement for column defaults (a nuanced point), distinguishes gh-ost (MySQL-oriented) from pg-osc (Postgres-specific), provides the full three-phase rename pattern with application deployment steps, and includes the replica validation practice. Schema migration vocabulary:Online DDL — data definition language changes that do not lock the table for the full duration. CREATE INDEX CONCURRENTLY — Postgres command for non-blocking index creation. gh-ost — GitHub's online schema change tool. Shadow table — a temporary copy of the original table used during migration. Dual-write — application writes to both old and new columns during a migration phase. Three-phase deployment — expand/migrate/contract pattern for schema changes. Options C and D are accurate but lack the Postgres 11+ column default detail and the tooling distinction.
4 / 5
The interviewer asks: "How do you approach capacity planning for a rapidly growing database?" Which answer is most thorough?
Option B is strongest: it names four dimensions with specific Postgres monitoring queries for each, provides alert thresholds with the reasoning behind them (70% = 2-3 months runway), explains the shared_buffers working-set memory failure mode (a critical Postgres concept), names the connection storm failure pattern, and closes with the most experienced insight — rehearse the scaling operation before you need it under pressure. Capacity planning vocabulary:pg_total_relation_size — Postgres function returning total size of a table including indexes and TOAST. shared_buffers — Postgres memory parameter for the buffer pool. Working set — the frequently accessed data that should fit in memory. Connection storm — a sudden spike in connection attempts, often after an incident. pg_upgrade — Postgres utility for major version upgrades. Options C and D are accurate but lack the working-set memory failure mode explanation and the rehearsal practice.
5 / 5
The interviewer asks: "How do you handle a production database failure where a full 24-hour recovery is not acceptable?" Which answer shows the strongest incident thinking?
Option B is strongest: it frames the answer around the relationship between failure type and recovery path, provides RTO estimates for each approach with the factors that affect them, explains RPO for each path, gives the application reconnection mechanism options (DNS, PgBouncer), notes the important caveat that promotion requires a healthy replica, distinguishes PITR's use case (data corruption vs. hardware failure), and closes with the proactive architectural fix recommendation. DBRE incident vocabulary:RTO (Recovery Time Objective) — maximum acceptable downtime. RPO (Recovery Point Objective) — maximum acceptable data loss. PITR (Point-in-Time Recovery) — restoring a database to a specific moment using WAL archives. WAL archiving — continuously copying WAL files to durable storage for recovery purposes. Hot standby promotion — converting a read replica to a primary database. PgBouncer config reload — updating PgBouncer's target database without restarting the pooler. Options C and D are accurate but lack the failure-type-to-recovery-path mapping and the architectural fix recommendation.