Practice vocabulary for evaluating agentic AI systems: trajectory evaluation, task completion rate, tool call accuracy, and benchmarks.
0 / 5 completed
1 / 5
A researcher says 'We use agent trajectory evaluation.' What does 'trajectory' refer to in this context?
An agent trajectory is the complete sequence of steps an agent takes — including observations, reasoning, and actions — from receiving a task to producing a final result. Evaluating the trajectory reveals whether the agent reasoned correctly, not just whether it got the right answer.
2 / 5
Your team reports 'The agent completed the task in 8 steps vs. the expected 5.' Why does step count matter in agentic evaluation?
In agentic systems, unnecessary steps add latency and cost, and each step is a point where errors can compound. Comparing actual vs. expected step counts helps evaluate agent efficiency and reasoning quality.
3 / 5
An evaluation report shows 'tool call accuracy: 78%.' What does this metric measure?
Tool call accuracy measures whether the agent selected the right tool AND supplied correct arguments. An agent might call the right tool with wrong parameters — a failure mode this metric captures.
4 / 5
A colleague mentions 'the agent's reasoning trace shows the agent misidentified the goal.' What is a reasoning trace?
A reasoning trace (also called a thought trace or chain-of-thought log) shows how the agent broke down a task, what it decided at each step, and why. It is essential for diagnosing agent failures.
5 / 5
Your team says 'We benchmark the agent on SWE-bench.' What type of benchmark is this?
SWE-bench is an agentic benchmark where agents must solve real-world software engineering tasks (GitHub issues) by writing and executing code. It is widely used to evaluate coding agents on realistic tasks.