5 exercises — practise answering Database Reliability Engineer interview questions in professional technical English.
0 / 5 completed
1 / 5
The interviewer asks: "How do you define and measure SLOs for a PostgreSQL cluster that serves a transactional application?" Which answer best demonstrates Database Reliability Engineer expertise?
Option B is strongest because it defines multi-dimensional SLOs with specific numeric targets, explains the measurement mechanism for each (synthetic heartbeat, pg_stat_statements, replication LSN), calculates the error budget, and describes proactive alerting at burn rate. Option A defines only availability without latency or replication dimensions. Option C is reactive with no quantified targets. Option D incorrectly separates database reliability from SLO ownership.
2 / 5
The interviewer asks: "Walk me through a failover procedure when the primary PostgreSQL instance becomes unresponsive." Which answer best demonstrates Database Reliability Engineer expertise?
Option B is strongest because it describes an automated four-phase procedure with specific tooling (Patroni, etcd, STONITH, PgBouncer), explains why each phase is necessary (especially fencing for split-brain prevention), and gives concrete timing targets. Option A is unsafe — restarting a primary during a failure scenario can cause data loss if it has already been superseded. Option C is manual and too slow for a production SLO. Option D defers reliability to the provider without understanding the failover mechanism.
3 / 5
The interviewer asks: "How do you detect and resolve connection pool exhaustion in a production PostgreSQL environment?" Which answer best demonstrates Database Reliability Engineer expertise?
Option B is strongest because it distinguishes the two root causes, provides specific SQL diagnostic queries, explains immediate and long-term remediation for each cause, and introduces the pool mode improvement (session vs transaction) that gives a 10x capacity improvement. Option A increases max_connections without understanding the cause — this can destabilise PostgreSQL memory and worsen performance. Option C adds read replicas, which does not help with write connection saturation. Option D restarts are a blunt instrument that causes downtime and does not address the underlying cause.
4 / 5
The interviewer asks: "How do you perform Point-In-Time Recovery (PITR) and what are the key RTO/RPO trade-offs?" Which answer best demonstrates Database Reliability Engineer expertise?
Option B is strongest because it explains the WAL archiving mechanism, names specific tooling (pgBackRest, WAL-G), gives concrete RPO and RTO figures with their determinants, describes the recovery_target_time configuration, and emphasises regular restore testing. Option A describes log shipping from application logs, not PostgreSQL WAL — this cannot guarantee consistency. Option C snapshots provide coarser-grained recovery than WAL and the RPO equals the snapshot interval. Option D describes HA failover, not PITR — replication cannot recover from logical data corruption or accidental deletion.
5 / 5
The interviewer asks: "How do you investigate and remediate a slow query that is causing latency spikes during peak traffic?" Which answer best demonstrates Database Reliability Engineer expertise?
Option B is strongest because it follows a structured five-step methodology: identifying high-impact queries by total time, capturing execution plans with ANALYZE and BUFFERS, diagnosing specific root causes (statistics staleness, bad plans), testing on a clone, and monitoring post-fix regressions. Option A adds indexes blindly, which can worsen write performance and may not address the actual problem. Option C runs EXPLAIN on a single instance without checking statistics, which is incomplete. Option D caches results, which is a valid performance pattern but does not fix the query itself and introduces cache invalidation complexity.