Vocabulary for AI Observability and LLM Tracing

As LLM-powered applications move into production, the discipline of observability is adapting to meet them. Traditional metrics — CPU, memory, request latency — are necessary but insufficient for understanding how an AI system behaves. A new vocabulary has emerged around LLM tracing, evaluation, and monitoring. This guide covers that vocabulary at the depth needed for engineers building and operating production AI systems.

Why LLM Observability Is Different

In a conventional web service, a request goes in and a deterministic response comes out. If something is wrong, you look at the error log, the database query, or the slow function call. The system is auditable.

In an LLM-powered system, the “computation” is a probability distribution over tokens. The same input can produce different outputs. The system can be confidently wrong — a property we call hallucination. There is no stack trace for a bad answer. Observability for LLMs requires capturing the full context of a model call: the prompt, the model parameters, the output, the latency, and the downstream effect of that output.

Core Tracing Vocabulary

Trace

A trace is the complete record of a single request as it flows through a system. In distributed tracing (OpenTelemetry, Jaeger), a trace is made of spans. For LLM systems, a trace typically captures the entire chain from user input to final response, including any intermediate tool calls, retrieval steps, or model calls.

“The trace shows that the latency spike came from the embedding step, not the generation call.”

Span

A span is a single unit of work within a trace — one function call, one HTTP request, or one LLM call. Spans have a start time, end time, and a set of attributes (key-value metadata).

“Each LLM call gets its own span. I can see the token counts, the model ID, and the latency for every call in the chain.”

LLM Span Attributes

For LLM calls specifically, spans are often enriched with:

model — the model name and version (e.g., claude-sonnet-4-5)
input tokens — the number of tokens in the prompt
output tokens — the number of tokens in the completion
total tokens — combined count
latency — time from request to first token (TTFT, time-to-first-token) and time to complete response
finish reason — why the model stopped generating (stop, max_tokens, tool_use)

Token Budget

A token budget is the maximum number of tokens allocated to a specific operation. Managing token budgets is critical for controlling cost and latency.

“We’re exceeding our token budget on the summarisation step. The retrieved context is too verbose — we need to compress it before passing it to the model.”

Key collocations:

exhaust the token budget — use all allocated tokens
trim the context — reduce the size of the prompt to fit within the budget
truncate — cut off content that exceeds a limit

Evaluation Vocabulary

Evals (short for evaluations) are the LLM equivalent of unit tests. They assess the quality of model outputs, either automatically or with human review.

Types of Evals

Offline eval — run against a fixed dataset before deployment; analogous to running tests in CI
Online eval — run against live production traffic; samples real inputs and assessments
LLM-as-judge — using another LLM to score the output of the primary model (“We use GPT-4o as a judge to rate our Claude responses on a rubric of accuracy, completeness, and tone.”)
Human eval — human annotators rate model outputs, typically for high-stakes or subjective criteria

Key Eval Metrics

Faithfulness — does the output accurately reflect the source material? (Critical for RAG systems)
Groundedness — is the output supported by the retrieved context?
Relevance — does the output address the user’s actual question?
Hallucination rate — the percentage of outputs containing factually incorrect or unsupported claims
Toxicity — the presence of harmful, offensive, or policy-violating content

“Our offline evals show a faithfulness score of 0.87 on the benchmark dataset, but our online eval is catching a higher hallucination rate on queries about recent events — the model’s knowledge cutoff is showing.”

Latency Profiling Vocabulary

LLM latency has a unique shape compared to conventional services. Key metrics:

TTFT (Time to First Token) — latency from request to the first token appearing in the stream. Critical for user-facing applications where streaming is used.
TPS (Tokens Per Second) — the generation rate; higher is faster.
E2E latency — end-to-end latency from user request to final response, including retrieval, tool calls, and generation.
p50 / p95 / p99 — percentile latency measurements. “Our p99 TTFT is 4.2 seconds, which is above our SLO of 3 seconds.”

Latency Sources

In an agentic or RAG system, latency accumulates across multiple steps:

Retrieval latency — time to fetch documents from a vector store or database
Embedding latency — time to convert text into vector embeddings
Generation latency — time the model takes to produce the response
Tool call latency — time for an external tool (API, database) to respond

Prompt Management Vocabulary

Prompt version — a specific, saved iteration of a prompt. “We’re on prompt version 12 for the support bot. v11 had a higher hallucination rate on refund queries.”
Prompt regression — when a new prompt version performs worse than the previous one on evals
System prompt — the instruction-level prompt given to the model before user input
Context window — the maximum input size the model can process in one call
Context stuffing — filling the context window with as much relevant information as possible (sometimes a sign of poor retrieval design)

Sampling and Temperature Vocabulary

Temperature — a parameter controlling output randomness. Higher temperature = more varied, creative responses. Lower = more deterministic.
Top-p (nucleus sampling) — limits token selection to the smallest set of tokens whose cumulative probability exceeds p
Top-k — limits token selection to the k most probable tokens
Sampling parameters — the collective term for temperature, top-p, top-k, and related settings

“We lowered the temperature from 0.8 to 0.2 for the classification task. The higher value was introducing too much variability.”

Key Vocabulary Summary

Trace — the complete record of a request through a system
Span — one unit of work within a trace
TTFT — time-to-first-token, the latency to the first streamed token
Token budget — the allocated token limit for an operation
Eval — an evaluation run that scores model output quality
Faithfulness — whether output accurately reflects source material
Hallucination — a model generating incorrect or unsupported information confidently
LLM-as-judge — using another model to evaluate output quality
Prompt regression — a new prompt version that performs worse than the old one
Context window — the maximum tokens a model can receive in one call
Temperature — a sampling parameter controlling output randomness

AI observability is a fast-moving field, but the conceptual vocabulary above is stable and increasingly standard across tooling like LangSmith, Arize Phoenix, Traceloop, and OpenTelemetry’s LLM semantic conventions. Knowing these terms lets you participate in technical design reviews, read monitoring dashboards, and write runbooks for AI systems with precision.

Vocabulary for AI Observability and LLM Tracing

Why LLM Observability Is Different

Core Tracing Vocabulary

Trace

Span

LLM Span Attributes

Token Budget

Evaluation Vocabulary

Types of Evals

Key Eval Metrics

Latency Profiling Vocabulary

Latency Sources

Prompt Management Vocabulary

Sampling and Temperature Vocabulary

Key Vocabulary Summary

What to Read Next

Practice This Vocabulary

IT Collocations Drills

Interview Preparation

IT Vocabulary Modules

Why LLM Observability Is Different

Core Tracing Vocabulary

Trace

Span

LLM Span Attributes

Token Budget

Evaluation Vocabulary

Types of Evals

Key Eval Metrics

Latency Profiling Vocabulary

Latency Sources

Prompt Management Vocabulary

Sampling and Temperature Vocabulary

Key Vocabulary Summary

Related Articles

What to Read Next

Practice This Vocabulary

IT Collocations Drills

Interview Preparation

IT Vocabulary Modules