Vocabulary for AI Observability and LLM Tracing
Master the advanced English vocabulary for AI observability — traces, spans, evals, token budgets, hallucination detection, and latency profiling for LLM-powered systems.
As LLM-powered applications move into production, the discipline of observability is adapting to meet them. Traditional metrics — CPU, memory, request latency — are necessary but insufficient for understanding how an AI system behaves. A new vocabulary has emerged around LLM tracing, evaluation, and monitoring. This guide covers that vocabulary at the depth needed for engineers building and operating production AI systems.
Why LLM Observability Is Different
In a conventional web service, a request goes in and a deterministic response comes out. If something is wrong, you look at the error log, the database query, or the slow function call. The system is auditable.
In an LLM-powered system, the “computation” is a probability distribution over tokens. The same input can produce different outputs. The system can be confidently wrong — a property we call hallucination. There is no stack trace for a bad answer. Observability for LLMs requires capturing the full context of a model call: the prompt, the model parameters, the output, the latency, and the downstream effect of that output.
Core Tracing Vocabulary
Trace
A trace is the complete record of a single request as it flows through a system. In distributed tracing (OpenTelemetry, Jaeger), a trace is made of spans. For LLM systems, a trace typically captures the entire chain from user input to final response, including any intermediate tool calls, retrieval steps, or model calls.
“The trace shows that the latency spike came from the embedding step, not the generation call.”
Span
A span is a single unit of work within a trace — one function call, one HTTP request, or one LLM call. Spans have a start time, end time, and a set of attributes (key-value metadata).
“Each LLM call gets its own span. I can see the token counts, the model ID, and the latency for every call in the chain.”
LLM Span Attributes
For LLM calls specifically, spans are often enriched with:
- model — the model name and version (e.g.,
claude-sonnet-4-5) - input tokens — the number of tokens in the prompt
- output tokens — the number of tokens in the completion
- total tokens — combined count
- latency — time from request to first token (TTFT, time-to-first-token) and time to complete response
- finish reason — why the model stopped generating (
stop,max_tokens,tool_use)
Token Budget
A token budget is the maximum number of tokens allocated to a specific operation. Managing token budgets is critical for controlling cost and latency.
“We’re exceeding our token budget on the summarisation step. The retrieved context is too verbose — we need to compress it before passing it to the model.”
Key collocations:
- exhaust the token budget — use all allocated tokens
- trim the context — reduce the size of the prompt to fit within the budget
- truncate — cut off content that exceeds a limit
Evaluation Vocabulary
Evals (short for evaluations) are the LLM equivalent of unit tests. They assess the quality of model outputs, either automatically or with human review.
Types of Evals
- Offline eval — run against a fixed dataset before deployment; analogous to running tests in CI
- Online eval — run against live production traffic; samples real inputs and assessments
- LLM-as-judge — using another LLM to score the output of the primary model (“We use GPT-4o as a judge to rate our Claude responses on a rubric of accuracy, completeness, and tone.”)
- Human eval — human annotators rate model outputs, typically for high-stakes or subjective criteria
Key Eval Metrics
- Faithfulness — does the output accurately reflect the source material? (Critical for RAG systems)
- Groundedness — is the output supported by the retrieved context?
- Relevance — does the output address the user’s actual question?
- Hallucination rate — the percentage of outputs containing factually incorrect or unsupported claims
- Toxicity — the presence of harmful, offensive, or policy-violating content
“Our offline evals show a faithfulness score of 0.87 on the benchmark dataset, but our online eval is catching a higher hallucination rate on queries about recent events — the model’s knowledge cutoff is showing.”
Latency Profiling Vocabulary
LLM latency has a unique shape compared to conventional services. Key metrics:
- TTFT (Time to First Token) — latency from request to the first token appearing in the stream. Critical for user-facing applications where streaming is used.
- TPS (Tokens Per Second) — the generation rate; higher is faster.
- E2E latency — end-to-end latency from user request to final response, including retrieval, tool calls, and generation.
- p50 / p95 / p99 — percentile latency measurements. “Our p99 TTFT is 4.2 seconds, which is above our SLO of 3 seconds.”
Latency Sources
In an agentic or RAG system, latency accumulates across multiple steps:
- Retrieval latency — time to fetch documents from a vector store or database
- Embedding latency — time to convert text into vector embeddings
- Generation latency — time the model takes to produce the response
- Tool call latency — time for an external tool (API, database) to respond
Prompt Management Vocabulary
- Prompt version — a specific, saved iteration of a prompt. “We’re on prompt version 12 for the support bot. v11 had a higher hallucination rate on refund queries.”
- Prompt regression — when a new prompt version performs worse than the previous one on evals
- System prompt — the instruction-level prompt given to the model before user input
- Context window — the maximum input size the model can process in one call
- Context stuffing — filling the context window with as much relevant information as possible (sometimes a sign of poor retrieval design)
Sampling and Temperature Vocabulary
- Temperature — a parameter controlling output randomness. Higher temperature = more varied, creative responses. Lower = more deterministic.
- Top-p (nucleus sampling) — limits token selection to the smallest set of tokens whose cumulative probability exceeds p
- Top-k — limits token selection to the k most probable tokens
- Sampling parameters — the collective term for temperature, top-p, top-k, and related settings
“We lowered the temperature from 0.8 to 0.2 for the classification task. The higher value was introducing too much variability.”
Key Vocabulary Summary
- Trace — the complete record of a request through a system
- Span — one unit of work within a trace
- TTFT — time-to-first-token, the latency to the first streamed token
- Token budget — the allocated token limit for an operation
- Eval — an evaluation run that scores model output quality
- Faithfulness — whether output accurately reflects source material
- Hallucination — a model generating incorrect or unsupported information confidently
- LLM-as-judge — using another model to evaluate output quality
- Prompt regression — a new prompt version that performs worse than the old one
- Context window — the maximum tokens a model can receive in one call
- Temperature — a sampling parameter controlling output randomness
AI observability is a fast-moving field, but the conceptual vocabulary above is stable and increasingly standard across tooling like LangSmith, Arize Phoenix, Traceloop, and OpenTelemetry’s LLM semantic conventions. Knowing these terms lets you participate in technical design reviews, read monitoring dashboards, and write runbooks for AI systems with precision.