Question 1

What is metric in observability?

Accepted Answer

A numeric measurement sampled over time — request rate, CPU, queue depth. Cheap to store and aggregate; great for dashboards and alerts.

Question 2

What is log in observability?

Accepted Answer

A timestamped record of a discrete event. High detail, high volume — the thing you grep when an alert fires.

Question 3

What is trace in observability?

Accepted Answer

The end-to-end story of one request as it crosses services, made of nested spans. Answers "where did the time go?".

Question 4

What is span in observability?

Accepted Answer

A single timed unit of work within a trace — one operation, with a start, duration, parent, and attributes.

Question 5

What is structured logging in observability?

Accepted Answer

Logging machine-parseable key/value records (usually JSON) instead of free-form text, so logs can be queried and filtered. Include a trace_id field so logs link straight to the trace.

Question 6

What is SLI in observability?

Accepted Answer

Service Level Indicator — the actual measured number you care about, e.g. the proportion of requests served under 300ms.

Question 7

What is SLO in observability?

Accepted Answer

Service Level Objective — the internal target for an SLI over a window, e.g. 99.9% of requests under 300ms over 30 days.

Question 8

What is SLA in observability?

Accepted Answer

Service Level Agreement — a contractual promise to customers, with penalties. Usually looser than your internal SLO.

Question 9

What is error budget in observability?

Accepted Answer

The allowed amount of failure: 100% minus the SLO. A 99.9% SLO grants ~43 minutes of downtime per 30 days to spend on risk. When the budget is spent, you freeze risky deploys and focus on reliability.

Question 10

What is burn rate in observability?

Accepted Answer

How fast you’re consuming the error budget relative to "on pace". A burn rate of 1 spends it exactly over the window; higher means trouble. Multi-window, multi-burn-rate alerts catch both fast outages and slow leaks.

Question 11

What is golden signals in observability?

Accepted Answer

Google SRE’s four signals to watch on any user-facing service: latency, traffic, errors, and saturation.

Question 12

What is MTTR in observability?

Accepted Answer

Mean Time To Recover/Repair — average time from an incident starting to service being restored. The headline DORA reliability metric.

Question 13

What is MTTD in observability?

Accepted Answer

Mean Time To Detect — average time between a problem starting and someone (or an alert) noticing it.

Question 14

What is MTTA in observability?

Accepted Answer

Mean Time To Acknowledge — average time from an alert firing to an on-call engineer acknowledging it.

Question 15

What is MTBF in observability?

Accepted Answer

Mean Time Between Failures — average healthy interval between incidents. Higher is better.

Question 16

What is uptime / availability (nines) in observability?

Accepted Answer

The fraction of time a service is up, quoted in "nines". Each extra nine cuts allowed downtime by 10x.

Question 17

What is cardinality in observability?

Accepted Answer

The number of unique label combinations a metric has. High cardinality (e.g. user_id as a label) blows up storage and cost. Keep IDs and free-form values out of metric labels; put them in traces or logs.

Question 18

What is label / tag in observability?

Accepted Answer

A key/value dimension attached to a metric (Prometheus calls it a "label"; Datadog/StatsD call it a "tag") so you can slice and filter.

Question 19

What is sampling in observability?

Accepted Answer

Keeping only a fraction of traces/logs to control cost. Head sampling decides up front; tail sampling decides after seeing the whole trace.

Question 20

What is percentile (p50/p95/p99) in observability?

Accepted Answer

The value below which that share of measurements fall. p99 latency = 99% of requests were faster than this. Averages hide the tail. You can’t average percentiles across instances — aggregate from a histogram instead.

Question 21

What is histogram in observability?

Accepted Answer

A metric that buckets observations (e.g. request durations) so you can compute percentiles and heatmaps server-side.

Question 22

What is gauge in observability?

Accepted Answer

A metric that can go up and down — a snapshot of a value right now (temperature, queue length, memory in use).

Question 23

What is counter in observability?

Accepted Answer

A metric that only ever increases (or resets to zero on restart). You take its rate() to get a per-second value.

Question 24

What is instrumentation in observability?

Accepted Answer

The code that emits metrics, logs and traces. Can be manual (you add it) or automatic (an agent/library injects it).

Question 25

What is context propagation in observability?

Accepted Answer

Passing trace identifiers (trace_id, span_id) across service boundaries — usually in HTTP headers — so spans link into one trace.

Question 26

What is RED method in observability?

Accepted Answer

For request-driven services, watch Rate, Errors, and Duration. A simpler, service-focused cousin of the golden signals.

Question 27

What is USE method in observability?

Accepted Answer

For resources (CPU, disk, network), watch Utilisation, Saturation, and Errors. Brendan Gregg’s checklist for finding bottlenecks.

Question 28

What is Prometheus in observability?

Accepted Answer

The de-facto open-source metrics database. Pulls (scrapes) metrics from targets and stores time series; queried with PromQL.

Question 29

What is PromQL in observability?

Accepted Answer

Prometheus Query Language — selects and aggregates time series for dashboards and alerts.

Question 30

What is Grafana in observability?

Accepted Answer

The dashboarding and visualisation layer. Plugs into Prometheus, Loki, and many other data sources; also handles alerting.

Question 31

What is OpenTelemetry (OTEL) in observability?

Accepted Answer

The vendor-neutral standard (and SDKs) for generating and exporting metrics, logs and traces. The Collector reshapes and ships them anywhere.

Question 32

What is Jaeger in observability?

Accepted Answer

An open-source distributed tracing backend — stores and visualises traces so you can see span timing across services.

Question 33

What is Loki in observability?

Accepted Answer

Grafana’s log database — indexes only labels, not full text, making it cheap. Queried with LogQL.

Question 34

What is Datadog in observability?

Accepted Answer

A commercial all-in-one observability SaaS — metrics, logs, traces (APM), and dashboards behind one agent and UI. Convenient but billed largely on custom metrics and ingested log/trace volume — watch cardinality.

Sections

The three pillars

metric

log

trace

span

structured logging

Reliability & SLOs

SLI

SLO

SLA

error budget

burn rate

golden signals

Incident metrics

MTTR

MTTD

MTTA

MTBF

uptime / availability (nines)

Core concepts

cardinality

label / tag

sampling

percentile (p50/p95/p99)

histogram

gauge

counter

instrumentation

context propagation

RED method

USE method

Tooling

Prometheus

PromQL

Grafana

OpenTelemetry (OTEL)

Jaeger

Loki

Datadog

English phrases engineers use

Related references

`metric`

`log`

`trace`

`span`

`structured logging`

`SLI`

`SLO`

`SLA`

`error budget`

`burn rate`

`golden signals`

`MTTR`

`MTTD`

`MTTA`

`MTBF`

`uptime / availability (nines)`

`cardinality`

`label / tag`

`sampling`

`percentile (p50/p95/p99)`

`histogram`

`gauge`

`counter`

`instrumentation`

`context propagation`

`RED method`

`USE method`

`Prometheus`

`PromQL`

`Grafana`

`OpenTelemetry (OTEL)`

`Jaeger`

`Loki`

`Datadog`