Event-Driven Systems Architect Interview Questions
5 exercises — choose the best-structured answer to common Event-Driven Architect interview questions. Focus on architectural precision, schema evolution strategy, saga design, and idempotency vocabulary.
Structure for event-driven architecture questions
Name the pattern: choreography vs. orchestration, saga, CQRS, event sourcing
Justify trade-offs: why this pattern over the alternative in this context
The interviewer asks: "Walk me through how you would design an event-driven order fulfilment system. How do you choose between choreography and orchestration for the saga?" Which answer best demonstrates architectural depth?
Option B is strongest: it names the pattern (saga), specifies all four services and their compensating transactions, explains why orchestration was chosen over choreography (explicit state machine, testability, traceability), and addresses the operational concern (durability/recovery). Option C is the "choreography is better" argument — but doesn't acknowledge when orchestration is the right call, and a senior architect should be able to justify the choice contextually, not dogmatically. Option D is technically a reasonable heuristic but doesn't demonstrate architectural vocabulary. Option A is correct as a high-level description but mixes choreography and orchestration without naming either. Key structure: name the pattern → justify the choreography vs. orchestration decision → specify compensating transactions → address recovery.
2 / 5
The interviewer asks: "How do you handle schema evolution in an event-driven system? What happens when you need to add a required field to an existing event?" Choose the most complete answer.
Option B is the strongest: it states the fundamental rule (new fields must be optional), explains the schema registry's role (enforcing compatibility), describes the full migration path for a breaking change (dual publishing + deprecation with sunset date), and adds real-world credibility ("hardest part of EDA at scale"). Option D is partially correct — versioning the event type is indeed the right approach for breaking changes — but skips the registry, the dual-publish transition period, and the consumer migration strategy. Option C attributes too much magic to Avro — it handles some cases but doesn't eliminate the need for a deliberate evolution strategy. Option A describes a big-bang lockstep deployment: the worst way to evolve schemas in a distributed system. Key tip: new fields → optional + default; breaking changes → new event type + dual publish + sunset date; always use a schema registry.
3 / 5
The interviewer asks: "What is the dead letter queue, and how do you design a DLQ strategy for a payment event consumer?" Which answer best demonstrates operational experience?
Option B is strongest: it classifies error types (transient vs. permanent) and routes them differently, specifies what metadata is attached to DLQ messages, defines alerting strategy, describes a supervised replay process, and adds an age-based audit policy. It treats the DLQ as an operational tool that needs governance, not just a safety net. Option C is dangerously simplistic — hourly auto-replay of payment DLQ messages without investigation is a recipe for duplicate charges. Option D is too conservative for all errors — transient errors can and should be retried. Option A is accurate but shallow. Key structure: classify error types → route appropriately → enrich DLQ messages → alert on depth → supervised replay for payments.
4 / 5
The interviewer asks: "How do you test an event-driven system where events are processed asynchronously?" Choose the answer that demonstrates the fullest testing strategy.
Option B is strongest: it describes four distinct test layers with precise scope for each, explicitly names contract testing (critical for schema governance), specifies the "do not test Kafka itself" principle, and calls out Thread.sleep() as an anti-pattern. The Awaitility-style polling approach is the correct way to write deterministic async tests. Option D is partially correct but accepts flakiness as inevitable — it's not, with proper polling and test isolation. Option C is functional but uses a 5-second sleep (makes tests slow and fragile). Option A is fundamentally similar to C without the sleep issue, but doesn't mention contract testing, which is critical in EDA to avoid schema incompatibility at runtime. Key structure: unit (no infra) → integration (Testcontainers) → contract (schema registry CI gate) → E2E polling (no sleep).
5 / 5
The interviewer asks: "An event was processed twice by a consumer and caused a duplicate payment. How would you prevent this in the future?" Which answer demonstrates the most complete design thinking?
Option B is the strongest: it names the root cause (at-least-once delivery), identifies the idempotency key source (event ID from the envelope), specifies the check-and-write atomicity requirement, gives a TTL range with justification, and extends the responsibility to the upstream payment provider (true defense in depth). Option D is not wrong — a unique constraint is a valid idempotency mechanism — but it relies on the error path (catching exceptions) which is fragile and doesn't address the upstream provider layer. Option C conflates Kafka producer idempotency (prevents duplicates within a Kafka transaction) with consumer-side idempotency; exactly-once delivery in Kafka does not protect against consumer crash-and-reprocess scenarios. Option A is correct in concept but omits atomicity (the check-and-write race condition). Key tip: always use an explicit idempotency key, make the check + write atomic, set an appropriate TTL, and remember the upstream dependency.