5 exercises — traceId and distributed tracing, connection pool exhaustion, degraded health checks, circuit breakers, and rate limiting. Read log evidence and communicate findings clearly.
0 / 5 completed
JSON log field reference
timestamp — when the event happened (ISO 8601 / UTC)
A log entry reads: {"timestamp":"2026-04-07T03:14:22.441Z","level":"ERROR","service":"payment-api","traceId":"9f2c1d8e","userId":"usr_8821","msg":"charge failed","error":"card_declined","duration_ms":312}
A colleague asks: "Which field should I use to find all other log entries from the same request across multiple services?" What is your answer?
traceId (also called correlation ID or request ID) is the key field for distributed tracing.
What traceId means: When a single user action (e.g., a payment) touches multiple services (payment-api → fraud-service → bank-gateway → notification-service), a single traceId is injected at the first service and passed to every downstream service. This allows you to reconstruct the complete request path across all services.
Common names for this field: • traceId, trace_id • requestId, request_id • correlationId, X-Correlation-ID • spanId (sub-request within a trace in OpenTelemetry)
In Kibana/Splunk/Grafana Loki: traceId: "9f2c1d8e" → shows all log lines across all services for this request
Other fields explained: • service — identifies which service emitted the log • userId — identifies the user (may span many requests) • duration_ms — how long the operation took • error — machine-readable error code (vs msg which is human-readable)
2 / 5
You see these two log entries from the same service within 2 seconds: {"level":"WARN","msg":"database connection pool exhausted, waiting for connection","pool_size":10,"waiting":8} {"level":"ERROR","msg":"database query timeout after 30000ms","query":"SELECT * FROM orders WHERE...","timeout_ms":30000}
What is the correct interpretation of what these two log lines are telling you together?
These two log lines together tell a causal story — read them in sequence:
Line 1 (WARN):pool exhausted, waiting: 8 → All 10 database connections are occupied. 8 requests are queued waiting for a free connection. → This is a warning, not yet an error — the system is degraded but still functioning.
Line 2 (ERROR):query timeout after 30000ms → A request waited so long for a connection (or the query itself ran long) that it hit the 30-second timeout. → This is now an error — requests are failing.
Possible root causes to investigate: 1. Slow queries — queries holding connections too long, blocking the pool 2. Connection leak — connections opened but never returned to the pool 3. Traffic spike — more concurrent requests than the pool supports 4. Pool size too small for current load
Key log reading vocabulary: • pool_exhausted / waiting — resource saturation signal • timeout after Nms — request did not complete within the allowed time • connection leak — connections not returned to pool • pool_size — maximum concurrent database connections configured
3 / 5
A log entry shows: {"level":"INFO","msg":"health check","status":"ok","checks":{"db":"ok","redis":"ok","queue":"degraded"},"duration_ms":45}
What action, if any, does this log entry require?
Log levels are not the whole picture — always read the content too.
This is an important trap: the level is INFO, which may seem benign, but the content shows a degraded dependency.
Understanding status: degraded in health checks: • ok — fully healthy • degraded — functioning but with reduced capacity or elevated error rate • failure / error — not functioning
A "degraded" queue might mean: • Consumer lag is increasing (messages not being processed fast enough) • The queue is approaching its size limit • Some queue workers are down • Connection retries are occurring silently
Why this was logged at INFO, not WARN or ERROR: The health check itself completed successfully (it checked and returned a result). The health check service often logs at INFO regardless of component status, relying on downstream alerting to trigger on degraded values.
Standard health check log vocabulary: • status: ok — all components healthy • status: degraded — service is up but one or more components are impaired • status: unavailable / down — service is not serving traffic • checks: { component: "degraded" } — per-dependency health detail
4 / 5
During an incident, you find this log line: {"level":"ERROR","msg":"upstream service unavailable","upstream":"inventory-service","attempts":3,"last_error":"connection refused (ECONNREFUSED)","circuit_breaker":"open"}
What does "circuit_breaker: open" mean in this context?
A circuit breaker is a resiliency pattern that stops calling a failing upstream service to prevent the failure from spreading.
Circuit breaker states: • Closed (normal) — requests pass through; failures are counted • Open — threshold of failures exceeded; ALL requests to this upstream are immediately rejected (no attempt made) for a timeout period • Half-open — timeout expired; allows a few test requests through to see if the upstream has recovered
Why "open" is actually a protective measure, not a problem: Without a circuit breaker: 1. inventory-service returns ECONNREFUSED 2. Every request tries to call it and waits for a timeout 3. Thread pool fills up waiting for timeouts 4. Your service becomes slow or unresponsive (cascade failure)
With an open circuit breaker: 1. Requests that need inventory-service fail fast with a clear error 2. Other functionality continues working 3. Your service logs clearly what is unavailable
ECONNREFUSED meaning: The operating system returned "Connection Refused" — the target host exists but nothing is listening on that port. This usually means the upstream service process has crashed or is restarting.
Key vocabulary: • upstream — a service that this service calls • attempts: 3 — tried 3 times before failing • circuit_breaker: open — requests blocked until upstream recovers • ECONNREFUSED — OS-level: port exists but nothing listening
5 / 5
You are investigating a spike in errors. You find this log entry: {"level":"WARN","msg":"rate limit applied","client_ip":"203.0.113.42","endpoint":"/api/search","requests_last_minute":847,"limit":100,"action":"throttled","retry_after_ms":24000}
Write a one-sentence Slack incident update based only on the information in this log line. Which of the following is the best update?
A good incident update translates technical log data into a clear, factual, actionable statement.
Why option B is best: 1. States the specific fact: one client IP, specific endpoint, specific rate (847 vs 100 limit) 2. Describes the system response: throttled, retry in 24s (the system is working as designed) 3. Acknowledges uncertainty: "Investigating whether..." — correctly notes this is being analyzed, not yet resolved 4. Lists possible root causes: three plausible explanations without guessing
What makes the other options poor: • A — vague: "something is wrong" provides no actionable information • C — misinterpretation: the rate limiter is working correctly; calling it "broken" is wrong • D — wrong action: no evidence that a restart would help
Key log fields to extract for incident updates: • client_ip → WHO is causing the issue • endpoint → WHAT is being affected • requests_last_minute: 847 vs limit: 100 → HOW SEVERE • action: throttled → SYSTEM RESPONSE (is it being handled?) • retry_after_ms: 24000 → RECOVERY TIME
Incident update vocabulary: "A client is exceeding rate limits on [endpoint] — throttling has been applied." "Investigating whether this is [X], [Y], or [Z]." "The system is handling this via [mechanism]; no service disruption at this time."