English for Datadog Observability
Learn the English vocabulary for working with Datadog: monitors, dashboards, tags, SLOs, and the terms for discussing observability with your team.
Datadog unifies metrics, logs, traces, and synthetic checks under one platform, and the terms teams use to talk about it — monitors, tags, SLOs — carry specific meaning that’s worth getting right, especially when an alert fires at 3am and precision saves time.
Key Vocabulary
Monitor — a configured rule that evaluates a metric, log query, or trace condition against a threshold and triggers a notification when it’s breached, Datadog’s term for what other tools call an alert rule. “We set up a monitor on p99 latency that pages on-call if it stays above 500ms for five consecutive minutes.”
Tag — a key-value label (like env:production or service:checkout) attached to metrics, logs, and traces that lets you filter, group, and correlate data across all three signal types consistently.
“Every service ships with a team tag now, so we can filter the entire dashboard down to just our team’s services instead of scrolling through everyone else’s.”
SLO (Service Level Objective) — a target reliability threshold (like 99.9% of requests succeeding over 30 days) tracked against a defined SLI, with an error budget that shows how much unreliability is left before the target is breached. “Our checkout SLO is 99.9% success over a rolling 30 days — we’re currently burning error budget faster than the month’s pace allows, which is why we froze non-critical deploys.”
Dashboard — a curated collection of widgets (timeseries graphs, query values, heatmaps) built to give an at-a-glance view of a service’s or team’s health, distinct from ad-hoc exploration in the metrics explorer. “The on-call dashboard shows error rate, latency, and saturation for every service in one view — it’s the first thing anyone opens when a page comes in.”
Faceted search (log facets) — the structured attributes extracted from logs (status code, endpoint, user ID) that let you filter and pivot log data without writing a full-text search query.
“Instead of grepping the raw log message, we filtered by the http.status_code facet directly — it’s indexed and much faster than a text search across millions of log lines.”
Common Phrases
- “Is this a monitor threshold breach, or just noisy data that needs a longer evaluation window?”
- “Are these services tagged consistently, or is that why the dashboard is missing some of them?”
- “How much error budget is left on this SLO before we need to freeze deploys?”
- “Is this dashboard curated for on-call, or is it more of an exploration view?”
- “Can we filter this by a log facet, or do we need a full-text search here?”
Example Sentences
Reporting an incident trigger: “The page came from a monitor on error rate exceeding 5% over a five-minute window — it correctly caught the regression about ninety seconds after the bad deploy went out.”
Explaining a tagging convention in onboarding:
“Every service needs env, team, and service tags at minimum — without them, this service won’t show up correctly on the shared dashboards or in cross-team queries.”
Discussing SLO status in a review: “We’ve burned sixty percent of this month’s error budget already, mostly from the incident on the 15th — if we don’t slow down on risky deploys, we’ll breach the SLO before month end.”
Professional Tips
- Name the specific monitor that triggered a page in incident reports — “an alert fired” without the monitor name forces the reader to go hunting for context that should already be in the report.
- Enforce tag consistency early and mention it in onboarding — inconsistent tagging is the single most common reason a dashboard silently excludes a service.
- Reference error budget remaining, not just the SLO target, when discussing release risk — a team with budget left can take more risk than a team that’s already over.
- Distinguish a curated dashboard from ad-hoc exploration when pointing someone to a view — “check the dashboard” is more useful when it’s clear which one and why it’s the right one for the question.
Practice Exercise
- Write a sentence describing a monitor and the condition that triggers it.
- Explain what an SLO and error budget mean in your own words.
- Describe how tags help correlate data across metrics, logs, and traces.