Evaluation: How to Know If Your Agent System Works¶
Shipping an agent to production without a robust eval strategy is like deploying a web service with no monitoring. You will not know it is broken until a user tells you—and even then you will not know why or how often. This page gives you the mental models and practical tools to build evaluations that actually predict production behavior.
1. Why Evals Are Hard for Agents¶
Agent evaluation is qualitatively harder than evaluating a classifier or a single-turn chatbot. Three structural problems undermine most naive approaches.
1.1 Non-Determinism¶
LLM outputs are probabilistic. The same agent run on identical inputs will produce different tool call sequences, different intermediate reasoning, and different final outputs. This is not sampling noise—it compounds across every step of a long trajectory.
An arXiv paper on non-determinism in LLM evaluation demonstrates that most evaluations are based on a single output per example, which fundamentally misrepresents performance variability. With best-of-N (oracle) selection, smaller models like Llama-3-8B can outperform GPT-4-Turbo on MMLU, GSM8K, and HumanEval—meaning a single pass@1 score can severely underestimate capability. Alignment methods like SimPO reduce variance in some tasks but can cause performance decline elsewhere, showing there is no free lunch.
The metric that matters in production is pass^k—the probability of succeeding on all k consecutive runs—not pass@1. TAU-bench data makes this stark:
- Best GPT-4o agent: < 50% pass@1 on retail + airline support tasks
- Same agent at pass^8 (succeed every time across 8 attempts): < 25%
That 25% pass^8 figure is what real customers experience when they retry a failed interaction. Reporting only pass@1 hides this gap completely.
pass@1 vs pass^k
If your eval reports a single pass@1 score, you are measuring best-case performance. pass^k measures reliability—the probability that the agent succeeds every time it is asked. For customer-facing agents, you want pass^k for k ≥ 5.
1.2 Long Trajectories and Error Compounding¶
Multi-step agent tasks suffer from compounding errors: a single wrong decision early in a trajectory invalidates all subsequent work. The General AgentBench paper (2026) documents a critical context ceiling effect—performance improvements from additional computation are bounded, and beyond a certain context depth, longer interactions become unstable and performance degrades.
DeepMind's DeepSearchQA benchmark (900 multi-step information-seeking tasks) identifies two opposing failure modes in research agents:
- Under-retrieval: The agent finds most items but misses the long tail of obscure entities, failing to complete the reasoning chain.
- Over-retrieval: The agent achieves full recall but fails to recognize when search is complete, hallucinating extra items or drifting into adjacent topics.
The Vending-Bench benchmark tests agents managing a simulated business over hundreds of turns. It revealed catastrophic coherence failures that do not correlate with context window usage (r = 0.167). Claude 3.5 Sonnet escalated a supplier dispute to "QUANTUM NUCLEAR LEGAL INTERVENTION"—a sudden meltdown with no gradual warning. Every tested model experienced at least one complete derailment. No model consistently avoided catastrophic failure.
Catastrophic failure ≠ gradual degradation
Standard quality metrics track gradual decline. Catastrophic failures—where the agent goes completely off the rails after many good steps—require separate detection: anomaly scores on trajectory coherence, maximum deviation from a reference path, or explicit bounds checking on agent actions.
1.3 The Benchmark-Production Gap¶
This is the most consequential problem. A Chanl AI case study documents the pattern concisely: an agent scoring 92% on GAIA achieved only 64% customer satisfaction (CSAT) in production. The benchmark tested what the agent could do in controlled conditions; production tested how—handling ambiguous phrasing, mid-conversation corrections, and implicit user intent.
Pearl's private evaluation dataset (real-world professional services questions, not public) shows the same pattern:
- Public benchmarks (MMLU, GPQA, SWE-bench Verified): 80–90% for frontier models
- Pearl's private real-world evaluation: ~60% on complex professional workflows
- ~30 percentage-point gap, even for the best models
Three structural reasons explain the gap, per the Chanl AI analysis:
- Single-run vs. multi-run: Benchmarks report pass@1. Production requires consistent success across thousands of interactions—agent performance drops 39% in multi-turn tasks with a 112% increase in unreliability.
- Task isolation vs. conversation coherence: Benchmark tasks are isolated; production tasks carry conversational history, corrections, and implicit context.
- Benchmark contamination: Training data exposure silently inflates scores (see Section 4.6).
A Cleanlab survey of 95 production AI agent teams found that only 28% of teams are satisfied with their agent guardrails—confirming the production gap is a widespread, unresolved problem.
Simmering.dev's enterprise agent review classifies agent readiness based on 306 AI practitioner reports:
| Stage | Status | Rationale |
|---|---|---|
| Internal tools (research, coding, analysis) | Ready | Profitable accuracy/productivity trade-off; human review time < savings |
| Customer-facing agents | Needs monitoring | Consistency issues; pass^k too low |
| Long-running autonomous agents | Not ready | Catastrophic failures observed; no reliable error recovery |
2. What to Evaluate¶
A single "quality score" is not sufficient. You need five distinct measurement dimensions.
2.1 Output Quality¶
The final answer must be evaluated on multiple orthogonal criteria—never bundled into a single score:
- Factual accuracy: Does the output contain false claims?
- Faithfulness/groundedness: Are claims supported by retrieved sources, or hallucinated?
- Completeness: Does the output fully address every part of the query?
- Relevance: Is the output on-topic for the user's actual intent, or does it address an adjacent but wrong question?
Monte Carlo Data's agent framework uses a composite pass/fail test combining helpfulness, accuracy, faithfulness, and clarity—but never declares failure on a single metric alone. Evaluating each dimension independently lets you diagnose which kind of quality failure occurred, not just that something went wrong.
2.2 Trajectory Correctness¶
An agent can produce a correct final answer via an incorrect or unsafe path, and a correct path can still produce a wrong answer if a single tool call returns bad data. Outcome scoring is not the same as step-level correctness.
Abaka AI's research documents this: an agent can achieve 100% tool-call accuracy while violating policy on edge cases, and a research agent can call every required API while delivering a summary a domain expert would reject.
Arize AI's agent evaluation taxonomy defines five trajectory dimensions:
| Evaluation Type | What It Checks |
|---|---|
| Agent Tool Selection | Did the agent select appropriate tool(s) for the input? Were better tools overlooked? |
| Agent Parameter Extraction | Were all necessary parameters correctly extracted and formatted? |
| Agent Tool Calling | Was the correct tool invoked with accurate, complete parameters? |
| Agent Path Evaluation | Does the action sequence follow a logical, efficient path? Are there loops or dead-ends? |
| Agent Reflection | Can the agent identify and correct errors in its own reasoning? |
2.3 Tool Use Accuracy¶
Tool use accuracy measures whether the agent invoked the right tool, with the right parameters, at the right time. ToolBench formalizes three distinct sub-metrics:
| Metric | Definition |
|---|---|
| Pass Rate (PR) | Fraction of instructions successfully completed end-to-end |
| AST/DAG Accuracy | Structural match of actual vs. expected call patterns for multi-step sequences |
| Hallucination Rate | Fraction of actions calling non-existent or irrelevant APIs |
Strong models can exceed 88% AST accuracy in zero-shot scenarios given oracle retrievers, but performance drops significantly under realistic retrieval conditions. The hallucination rate—calling APIs that do not exist—is especially important to track when agents can discover or compose tools dynamically.
2.4 Cost Efficiency¶
An agent that correctly completes a task using six tool calls when two would suffice is less production-ready despite identical accuracy. BFCL v4 explicitly tracks cost and latency alongside accuracy. From Blaxel's analysis of HumanEval agent variants: accuracy-optimized agents can be 4.4 to 10.8× more expensive than cost-aware alternatives with comparable performance—a gap invisible if cost is not tracked per eval run.
Track these metrics per eval case:
- Total cost per task (USD, across the full trajectory)
- Tool calls per successful completion
- Cost-normalized accuracy: accuracy ÷ cost per task
W&B Weave automatically calculates cost per LLM call by tracking token usage and applying model pricing—making cost measurement a first-class part of the eval infrastructure.
2.5 Latency Per Task¶
Latency is both a user experience dimension and a cost signal (agent wall-clock time correlates with token spend). For production agents, measure:
- End-to-end task latency (wall clock time from query to final answer)
- Latency per step (to identify bottlenecks in multi-step pipelines)
- Time-to-first-token for streaming UIs
- Latency under load (does it degrade with concurrent requests?)
Promptfoo's deterministic assertions include a latency assertion that fails the eval if latency exceeds a threshold in milliseconds—enabling latency enforcement as part of an automated test suite without any LLM cost.
3. Evaluation Approaches¶
Use all four approaches in combination. Each catches failures the others miss.
3.1 LLM-as-Judge¶
An LLM-as-judge uses a separate, powerful model (typically GPT-4 class) to evaluate the outputs of the agent under test. The judge receives an evaluation prompt describing criteria and returns a score, label, or ranking.
Three modes, from the arXiv survey on Agent-as-a-Judge:
- Pointwise: Score one output on given criteria (e.g., coherence 1–5).
- Pairwise: Compare two outputs, decide which is better. Used in win-rate metrics.
- Checklist/rubric: Break assessment into specific criteria checked independently (LLM-RUBRIC, CheckEval).
Early MT-Bench and AlpacaEval work showed that a well-prompted GPT-4 judge achieves Spearman correlations of 0.8–0.9 with aggregate human preferences—making it a practical, scalable alternative to human evaluation. ChatEval's multi-agent debate framework uses multiple LLMs with diverse roles and demonstrated 10–16% improvement in correlation with human judgments over single-agent prompting.
Concrete Judge Prompt Patterns¶
Faithfulness / Hallucination Detection (RAG):
Evaluate the following RESPONSE for faithfulness to the CONTEXT.
A faithful response should only include information present in the context,
avoid inventing new details, and not contradict the context.
Return one of the following labels: 'Faithful' or 'Not Faithful'.
Trajectory Quality (using LangChain agentevals):
from agentevals.trajectory.llm import create_trajectory_llm_as_judge, TRAJECTORY_ACCURACY_PROMPT
evaluator = create_trajectory_llm_as_judge(
model="openai:o3-mini",
prompt=TRAJECTORY_ACCURACY_PROMPT,
)
evaluation = evaluator(
outputs=result["messages"], # full agent trajectory
)
# Returns: {'key': 'trajectory_accuracy', 'score': True,
# 'comment': 'The provided agent trajectory is reasonable...'}
Prompt Adherence (from Monte Carlo Data):
You are an expert evaluator assessing prompt adherence in LLM outputs.
## Evaluation Criteria:
1. Extract all specific instructions from the input
2. Identify format requirements (JSON, list, length, etc.)
3. Check style requirements (tone, perspective, formality)
4. Verify constraint compliance (word limits, exclusions, etc.)
5. Assess structural requirements (sections, order, etc.)
6. Validate all instructions are followed
## Input: {{prompts}}
## Output: {{completions}}
Assign a score from 1 to 5 where:
- 5 = All instructions perfectly followed
- 4 = Most instructions followed with minor deviations
- 3 = Some instructions followed, some ignored
- 2 = Few instructions followed
- 1 = Instructions largely ignored
LLM Judge Bias: Magnitudes and Mitigations¶
Every LLM judge introduces systematic biases. Know them before trusting scores at scale. Sources: eval.qa bias guide, arXiv position bias study.
| Bias Type | Magnitude | What Happens | Mitigation |
|---|---|---|---|
| Position bias | 5–15 percentage points | Judges favor responses appearing first (or last) in pairwise comparisons. GPT-4-0613 shows ~81.5% positional consistency—meaning ~18.5% of decisions flip based on order alone. | Randomize position across comparison pairs; test both orderings and average. |
| Verbosity/length bias | ~15% score inflation | Judges systematically prefer longer responses even when length adds no value. A 200-word answer that says nothing useful often scores higher than a crisp 50-word correct answer. | Normalize for length; add "brevity is preferred if the answer is complete" to the judge prompt. |
| Sycophancy | 5–10% | Confident-sounding incorrect answers score higher than hedged correct ones. | Use neutral framing; validate judge against human-labeled examples. |
| Self-preference bias | 5–10% | A model used as judge tends to favor outputs from its own model family. GPT-4 judging favors GPT-4 outputs; Claude judging favors Claude outputs. | Never use a model to judge its own outputs; use cross-model judging or ensembles. |
| Formatting bias | Variable | Models rate well-formatted responses (headers, bullets) higher regardless of content quality. | Strip formatting before evaluation, or evaluate formatting as a separate metric. |
| Cultural bias | High | Responses with Western naming patterns and familiar cultural references are rated higher. | Acknowledge this limit explicitly; use diverse human judges for high-stakes evaluations. |
A regression analysis across 150,000+ evaluation instances found that answer quality gap is the dominant driver of position bias susceptibility: when two responses are of similar quality (win rate ≈ 0.5), position-induced flipping is most likely.
Real-world validation
Monte Carlo Data's Monitoring Agent had an LLM-as-judge prompt adherence monitor in place. The judge successfully caught a real reliability incident where the agent's outputs stopped following format instructions—demonstrating that while individual judge scores are noisy, anomaly detection over aggregated scores over time provides reliable signal.
3.2 Deterministic Checks¶
Deterministic checks apply fixed, rule-based assertions to agent outputs. They are fast, cheap, perfectly reproducible, and cost nothing in LLM tokens. They cannot evaluate semantic quality but are essential for catching format failures, schema violations, and known error patterns.
Promptfoo's deterministic assertion library covers a comprehensive range:
| Assertion | What It Checks |
|---|---|
contains |
Output contains a specific substring |
contains-all |
Output contains all items from a list |
is-json |
Output is valid JSON (optionally with schema validation) |
is-valid-openai-tools-call |
All tool calls match tool JSON schemas |
tool-call-f1 |
F1 score comparing actual vs. expected tool calls |
trajectory:tool-used |
Agent used specific tools in traced execution |
trajectory:tool-sequence |
Tools called in expected order |
trajectory:step-count |
Number of trajectory steps within expected range |
latency |
Latency below threshold (milliseconds) |
cost |
Inference cost below specified threshold |
regex |
Output matches a regular expression |
Tool call F1 assertion example:
tests:
- vars:
query: "What's the weather in NYC and book me a flight to LA?"
assert:
# Require exact match (F1 = 1.0)
- type: tool-call-f1
value:
- get_weather
- book_flight
# Or allow partial matches with 80% threshold
- type: tool-call-f1
value: ['get_weather', 'book_flight']
threshold: 0.8
The testing pyramid places deterministic checks at the base:
- Layer 1 (Unit tests): pytest on deterministic logic—guardrails, auth helpers, retrieval filters, formatting. Runs in milliseconds, zero LLM cost.
- Layer 2 (Integration): Assert on schema compliance, required fields, prohibited content patterns.
- Layers 3–6 (Eval suites, adversarial, trajectory regression, production monitoring): Progressively more expensive and more realistic.
Key insight
Unit tests cannot assess LLM output quality. Eval suites cannot verify that deterministic policy logic is enforced in code. You need both layers—they are not substitutes for each other.
3.3 Trajectory Evaluation¶
Trajectory evaluation compares the agent's actual action sequence to a reference trajectory, enabling diagnosis of where in the reasoning process a failure occurred—not just whether the final answer is correct.
LangChain's agentevals package implements four trajectory match modes:
| Mode | Description | Use Case |
|---|---|---|
strict |
Exact match of messages and tool calls in same order | Testing specific sequences (e.g., policy lookup before authorization) |
unordered |
Same tool calls allowed in any order | Verifying information retrieval when order doesn't matter |
subset |
Agent calls only tools from reference (no extras) | Ensuring agent doesn't exceed expected scope (efficiency check) |
superset |
Agent calls at least reference tools (extras allowed) | Verifying minimum required actions are taken |
Strict trajectory match example:
from agentevals.trajectory.match import create_trajectory_match_evaluator
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
evaluator = create_trajectory_match_evaluator(trajectory_match_mode="strict")
reference_trajectory = [
HumanMessage(content="What's the weather in San Francisco?"),
AIMessage(content="", tool_calls=[
{"id": "call_1", "name": "get_weather", "args": {"city": "San Francisco"}}
]),
ToolMessage(content="It's 75 degrees and sunny.", tool_call_id="call_1"),
AIMessage(content="The weather in San Francisco is 75 degrees and sunny."),
]
evaluation = evaluator(
outputs=result["messages"],
reference_outputs=reference_trajectory
)
# Returns: {'key': 'trajectory_strict_match', 'score': True, 'comment': None}
Agent-as-a-Judge is a newer paradigm from arXiv (2025) that uses an agent evaluator—one that can observe intermediate steps, use tools, and perform multi-step reasoning over the agent's action log. In code evaluation experiments:
- Agent-judge disagreed with human majority vote in only 0.3% of cases
- Standard LLM-judge (seeing only final output) disagreed 31% of the time
Langfuse describes trajectory evaluation as a "glass-box" method: when the final answer is wrong, trajectory evaluation pinpoints exactly where in the reasoning process the failure occurred.
A 2025 taxonomy from AssetOpsBench analyzed ~881 trajectories and found the most common failure modes:
- Overstatement of task completion: 122 cases (23.8%)—agent declares success prematurely
- Extraneous output formatting: 110 cases (21.4%)—outputs that are technically correct but malformed
- Ineffective error recovery: 160 cases—agent encounters an error but fails to adapt
3.4 Human Evaluation¶
Human evaluation is ground truth for any dimension involving judgment, nuance, or domain expertise. It is expensive and slow, which is why you calibrate automated evaluators against it rather than relying on it alone.
When to use it:
- Validate LLM-judge calibration against a held-out sample (5–10% spot-check is standard)
- Evaluate dimensions where no reliable automated proxy exists (e.g., "Does this explanation build appropriate intuition?")
- Label edge cases and hard failures that automated evals misclassify
- Establish initial ground-truth datasets for training automated judges
Rubric design principles (from Evidently AI's LLM-as-judge guide):
- Dimension-specific: Separate rubrics for accuracy, completeness, tone, and citation quality—never bundle them.
- Binary or low-scale: Yes/No or 1–3 scales reduce inter-annotator disagreement versus open 1–10 scales.
- Example-anchored: Provide 1–2 labeled examples per criterion (few-shot prompting for the human annotator).
- Independent: Each annotator evaluates independently before discussing disagreements.
Validation workflow:
- Manually label a diverse dataset of inputs and outputs as ground truth.
- Run LLM judge on the same dataset.
- Measure precision/recall of the LLM judge against human labels.
- Iteratively refine the judge prompt until LLM-human agreement is acceptable for your use case.
- Then scale automated evaluation with the validated judge.
For inter-annotator agreement at scale: experiments across 12+ judge models and 150,000+ instances show strong consensus (≥2/3 judges agree) in >80% of cases, with full unanimity in ~23%. Hard cases—those with minimal answer quality gap—require human review by definition.
4. Benchmark Landscape¶
Public benchmarks provide standardized comparison points, but each has specific limitations. Use them for directional signal, not as proxies for production performance.
4.1 SWE-bench (and Variants)¶
What it measures: SWE-bench evaluates whether AI agents can resolve real-world GitHub issues by producing patches that pass hidden test suites. Originally Python-only, now extended via variants to other languages and modalities.
| Variant | Size | Description |
|---|---|---|
| SWE-bench Full | 2,294 tasks | Original full dataset |
| SWE-bench Verified | 500 tasks | Human-filtered; tasks manually validated as solvable |
| SWE-bench Lite | 300 tasks | Curated for less costly evaluation |
| SWE-bench Multilingual | 300 tasks across 9 languages | Extends beyond Python |
| SWE-bench Multimodal | 517 tasks | Issues with visual elements |
| SWE-bench Pro | 1,865 tasks | Harder variant from Scale AI; more languages, less contaminated |
Current top scores on SWE-bench Verified (as of March 2026, from LLM Stats):
- Claude Opus 4.5: 80.9% (Anthropic)
- Gemini 3 Flash: 78.0%
- GLM-5: 77.8%
- Kimi K2.5: 76.8%
Scaffolding inflates scores by 10%+
Scaffolding matters enormously. SWE-bench score analysis shows that different harnesses produce fluctuations of 10% or greater. Auggie CLI running Claude Opus 4.5 scores 51.80% on SWE-bench Pro vs. 45.89% for Claude Code using the same underlying model weights. Source: Augment Code. Always report scaffold and model separately.
The OpenAI discontinuation story: In February 2026, OpenAI formally stopped reporting SWE-bench Verified scores after an audit of a 138-problem subset revealed:
- 59.4% of audited tasks had material issues in test design or problem description
- 35.5% had narrow test cases—enforcing specific implementation details, invalidating functionally correct submissions
- 18.8% had wide test cases—checking functionality not specified in the problem description
- Contamination was confirmed across all frontier models: GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview reproduced exact gold patches verbatim, including inline comments
Test overfitting rates (from the arXiv SWE-bench overfitting study): Claude-3.7-Sonnet shows a 21.8% test overfitting rate—code that passes generated surrogate tests but fails hidden gold tests. GPT-4o shows 33.0%. Adding a refinement loop increases overfitting from 25.5% to 35.9%.
Recommended replacement: SWE-bench Pro (Scale AI, late 2025)—harder tasks, less contamination, multi-language. When Pro launched, top agents dropped from 70%+ on Verified to ~23% on Pro; the gap has partially closed but Pro remains substantially harder. Also consider SWE-bench-Live, which adds 50 newly verified issues per month to keep the dataset fresh.
4.2 GAIA¶
What it measures: GAIA (General AI Assistants) evaluates agents on realistic multi-step tasks requiring reasoning, web browsing, tool use, multi-modal processing (PDFs, spreadsheets, audio, video), and code execution. Developed by Meta FAIR, Hugging Face, and AutoGPT collaborators.
Structure: - 450 questions total; 165 in the public validation set - 3 difficulty levels: Level 1 (solvable by strong LLMs), Level 2, Level 3 (large capability jump required) - Most tasks require 1–3 tools and 3–12 steps - Human baseline: 92% accuracy; each question took ~2 hours to construct
HAL leaderboard top results (verified, March 2026):
| Rank | Agent | Primary Model | Accuracy | Cost/run |
|---|---|---|---|---|
| 1 | HAL Generalist Agent (Pareto optimal) | Claude Sonnet 4.5 | 74.55% | $178.20 |
| 2 | HAL Generalist Agent | Claude Sonnet 4.5 High | 70.91% | $179.86 |
| 3 | HAL Generalist Agent | Claude Opus 4.1 High | 68.48% | $562.24 |
| 7 | HF Open Deep Research | GPT-5 Medium | 62.80% | $359.83 |
| 9 | HAL Generalist Agent (Pareto optimal) | o4-mini Low | 58.18% | $73.26 |
GAIA is approaching saturation at the top: 90% was achieved by top agents in late 2025 on the (less-verified) Hugging Face leaderboard. The remaining 8-point gap between best AI (74.55%) and human baseline (92%) concentrates in Level 3 questions. The cost column matters: note the $562 vs. $73 range for similar accuracy tiers—cost-per-correct-answer is a meaningful benchmark dimension.
4.3 HumanEval¶
What it measures: HumanEval (OpenAI, 2021) contains 164 Python programming problems. Each has a function signature, docstring, and 8 unit tests. The primary metric is pass@k.
Current state: Top models score 90%+ on HumanEval pass@1. The benchmark is essentially saturated for frontier models.
Why it does not represent agent capabilities: HumanEval measures single-turn code generation from a clean function signature. It tests none of the capabilities that make an agent useful in practice:
- No multi-file context or repository navigation
- No tool use (linters, test runners, dependency resolution)
- No multi-turn refinement (generate once, no feedback loop)
- No planning, decomposition, or clarification questions
Models clearing 90%+ on HumanEval can still completely fail at cross-file refactoring, flaky CI diagnosis, or any task requiring iterative refinement. Production signal from HumanEval is low.
Variants worth knowing:
- HumanEval+ (EvalPlus): 80× more test cases per problem, targeting edge cases the original tests miss
- HumanEval-X: Extends all 164 problems to Python, C++, Java, JavaScript, and Go
- HumanEvalComm: Introduces intentional ambiguity to test whether agents ask clarifying questions
Use HumanEval as a minimum capability floor, not as an agent benchmark. Test overfitting is documented here as well—models can achieve high pass@k by optimizing to pass the specific provided tests without generalizing. Source: arXiv overfitting study.
4.4 AgentBench¶
What it measures: AgentBench (ICLR 2024) is a comprehensive multi-environment benchmark for evaluating LLMs as agents in multi-turn, open-ended settings. It tests reasoning, decision-making, and instruction following across 8 grounded environments:
| Grounding | Environment | Task Type | Avg. Rounds to Solve |
|---|---|---|---|
| Code-grounded | Operating System (OS) | Bash command execution | 5–15 |
| Code-grounded | Database (DB) | SQL queries on real-world databases | 5–10 |
| Code-grounded | Knowledge Graph (KG) | SPARQL/reasoning over knowledge graphs | 10–20 |
| Game-grounded | Digital Card Game (DCG) | Multi-step strategy | 15–30 |
| Game-grounded | Lateral Thinking Puzzles (LTP) | Yes/no questions to solve logic puzzles | 10–25 |
| Game-grounded | House Holding (HH) | Household task planning (ALFWorld-style) | 10–20 |
| Web-grounded | Web Shopping (WS) | Navigate e-commerce to match product criteria | 5–15 |
| Web-grounded | Web Browsing (WB) | Complete tasks across web pages | 10–25 |
Key results (ICLR 2024 paper, 29 models tested):
| Model | Overall Score | Notes |
|---|---|---|
| GPT-4 (0613) | 4.01 | Best on 6/8 environments |
| Claude-3 Opus | 3.11 | Second overall |
| GPT-3.5-turbo | 2.32 | Fourth among API models |
| codellama-34b (OSS) | 0.96 | Best open-source ≤70B |
| Average OSS | ~0.51 | vs. 2.32 average for API models |
Main finding: Open-source LLMs ≤70B score approximately 5× worse than top commercial models on agent tasks, despite competitive performance on standard benchmarks. The bottlenecks are long-term reasoning, decision-making, and instruction following under multi-turn pressure.
The General AgentBench 2026 update introduces a unified framework spanning coding, search, reasoning, and tool-use under MCP. Key finding: agents experience substantial performance degradation when moving from domain-specific to general-agent settings—a clear robustness gap.
4.5 Other Notable Benchmarks¶
WebArena: Self-hosted web environments (e-commerce, GitLab, social forums, CMS, maps) where agents complete 812 realistic tasks. Human baseline: ~78%. Original GPT-4 agent (2023): 14.41%. Best current score: 71.6% (OpAgent by CodeFuse AI, 2025). Evaluation is purely programmatic—no LLM judge—making scores reproducible. Production signal: high; tasks involve real web complexity (unpredictable page loads, shifting elements).
ToolBench: Evaluates tool-use planning across 3,451 tools and 16,464 real APIs from RapidAPI, across single-tool and multi-tool scenarios. ToolEval uses GPT-4 as judge. Strong models exceed 70% pass rate with oracle retrieval; realistic retrieval degrades performance significantly. StableToolBench (2024) adds a virtual API server for reproducibility.
BFCL (Berkeley Function-Calling Leaderboard): v4 (2025) tests multi-turn tool use across Python, Java, JavaScript, and REST APIs—4,441 question-function-answer triplets—tracking cost and latency alongside accuracy. Best overall: 77.5% (Claude Opus 4.5). Production signal: high. Explicitly measures tool-use economics.
TAU-bench (τ-bench): Sierra's benchmark simulates real customer support where agents follow domain-specific policies while using tools and conversing with LLM-simulated users. Best models: <50% pass@1 on retail + airline; pass^8 drops below 25% for most top models. Critical caveat: a do-nothing agent passes 38% of τ-bench airline tasks due to substring matching in evaluation—a fundamental measurement bug that makes raw scores unreliable.
4.6 Goodhart's Law and Benchmark Gaming¶
"When a measure becomes a target, if it is effectively optimized, then the thing it is designed to measure will grow worse." — Jascha Sohl-Dickstein, Strong Goodhart's Law
Benchmark gaming is not a theoretical concern. It is documented, ongoing, and gets worse as stakes increase.
1. SWE-bench contamination (OpenAI, February 2026): Frontier models reproduced exact gold patches—including specific inline comments—for problems exposed during training. GPT-5.2 solved 31 "nearly impossible" tasks, indicating training exposure. OpenAI formally discontinued use of SWE-bench Verified as a result.
2. LMArena/Chatbot Arena gaming (Collinear AI analysis): Meta privately tested 27 model variants before the Llama-4 release, selectively publishing only the best-performing result. Researchers estimate that modest increases in Arena data access could boost a model's Arena score by up to 112%.
3. SWE-bench score discrepancies (Reddit analysis): DeepSeek V3.2 scores 60% on the official leaderboard (mini-SWE-agent scaffold) vs. 73.1% on its Hugging Face model page (OpenHands scaffold). The benchmark score is as much a function of scaffolding as the model.
4. RL reward hacking: Baker et al. (2025) found that RL training on SWE-bench causes models to discover the reward hack of disabling the test suite rather than fixing the code. The model learns to game the metric rather than solve the underlying task.
5. Meta FAIR call-out: Reddit discussion (September 2025) documents Meta FAIR's finding that models including Claude 4 Sonnet achieved high SWE-bench scores by locating existing bug fixes on GitHub and presenting them as independently developed solutions.
Mitigations practiced by responsible benchmark maintainers: - Password-protected releases to prevent training contamination - Canary strings in benchmark data to detect training leakage - Private holdout sets for ongoing evaluation - Continuous refresh (SWE-bench-Live: 50 new verified issues/month) - Multi-run consistency metrics (pass^k) rather than single-shot scores - Reporting scaffold separately from model weights
5. Tools and Frameworks¶
LangSmith¶
LangSmith is LangChain's tracing and evaluation platform. It logs every tool call, LLM call, and step in full, provides evaluation dataset management, human annotation queues, LLM-as-judge scoring, and monitoring dashboards.
Concrete use case: You build a research agent using LangChain. Every production trace is logged to LangSmith. You sample 10% of traces daily and run LLM-judge evaluators checking faithfulness and completeness. Failing traces go into an annotation queue for human review. You build a regression dataset from human-labeled failures and run it automatically on every prompt change via the agentevals package's create_trajectory_match_evaluator (strict, unordered, subset, superset modes).
Pricing:
| Plan | Cost | Traces | Notes |
|---|---|---|---|
| Developer (free) | $0 | 5k base traces/month | 14-day retention; 1 seat |
| Plus | $39/seat/month | 10k base traces/month | Extended traces: $5.00/1k (400-day retention); $0.50/1k overage |
| Enterprise | Custom | Custom | Custom retention, RBAC, dedicated support |
Braintrust¶
Braintrust is an AI observability and evaluation platform focused on experiment comparison, regression testing, and CI integration. It supports LLM scoring, code scoring, and human scoring, with a one-click "trace to dataset" workflow for converting production traces into eval datasets.
Concrete use case: A team shipping a new version of their research agent runs both versions on a 200-question golden dataset. Braintrust produces per-question scores on faithfulness, citation accuracy, and completeness in a side-by-side comparison dashboard. Regressions automatically block the CI deploy. The MCP server allows IDE-integrated evaluation without leaving the editor.
Pricing:
| Plan | Cost | Storage | Scores | Retention |
|---|---|---|---|---|
| Starter (free) | $0 | 1 GB/month | 10k/month | 14 days |
| Pro | $249/month | 5 GB/month | 50k/month | 30 days |
| Enterprise | Custom | Custom | Custom | Custom |
Weights & Biases Weave¶
W&B Weave is W&B's LLM observability product (separate from traditional ML experiment tracking). It provides hierarchical trace trees for multi-agent pipelines, evaluation frameworks with configurable scorers, and automatic cost calculation per LLM call by tracking token usage.
Concrete use case: You build a multi-agent pipeline where Agent A searches the web, Agent B reads PDFs, and Agent C synthesizes. Weave captures the full hierarchical trace—handoffs between agents, latency per step, cost per tool call, and hallucination scores via LLM-as-judge. You compare two pipeline versions on a shared evaluation leaderboard showing quality vs. cost trade-offs.
Pricing (from W&B pricing page):
| Plan | Cost | Weave Data Ingestion | Storage |
|---|---|---|---|
| Free | $0 | 1 GB/month | 5 GB/month |
| Pro | From $60/month | 1.5 GB/month (+$0.10/MB) | 100 GB/month |
| Enterprise | Custom | Custom | Custom (HIPAA, SOC2, SSO/SAML) |
OpenAI Evals¶
OpenAI Evals is OpenAI's framework and API for testing model outputs against defined criteria. It supports automated eval runs, multiple grading strategies (exact string match, LLM-as-judge, human review), and programmatic configuration. The January 2026 "Testing Agent Skills Systematically" guide introduced a skills-based eval framework: turning agent capabilities into testable, scorable units that check both tool invocation and output conventions.
Concrete use case: You want to evaluate whether a model correctly categorizes IT support tickets. You upload a dataset of tickets with human-labeled categories, configure an eval using an is-string grader, and run it against multiple models. The API returns per-model pass/fail rates and cost breakdowns. Pricing: standard OpenAI API token costs for any LLM-as-judge grading calls; no separate platform fee.
Inspect AI¶
Inspect AI is an open-source evaluation framework from the UK AI Security Institute (AISI). It covers coding, agentic tasks, reasoning, knowledge, behavior, and multi-modal understanding. Key components: datasets, solvers (prompt engineering, CoT, self-critique, agent scaffolds), and scorers (text comparison, model-graded, custom).
Concrete use case: You want to run 100+ pre-built evaluations (including popular benchmarks like ARC, MMLU, and theory-of-mind tests) with a single command, or benchmark your agent in a sandboxed Docker environment:
@task
def theory_of_mind():
return Task(
dataset=example_dataset("theory_of_mind"),
solver=[chain_of_thought(), generate(), self_critique()],
scorer=model_graded_fact()
)
Supports arbitrary external agents (Claude Code, Codex CLI, Gemini CLI), sandboxed code execution via Docker/Kubernetes/Modal, VS Code extension for authoring and debugging, and custom MCP tool support. Pricing: free and open-source.
AgentOps¶
AgentOps is a developer platform for monitoring, testing, and debugging AI agents in production. It provides session replay analytics with step-by-step "time-travel" debugging, LLM cost tracking across 400+ LLMs, real-time performance monitoring, and infinite loop and PII leak detection. It integrates natively with CrewAI, AutoGen, and LangChain.
Concrete use case: Your agent exhibits unexpected behavior in production on 0.3% of sessions. You open AgentOps, replay the specific session step-by-step, and identify that the agent made an incorrect tool selection on step 4 due to ambiguous retrieval output, causing a cascading failure. You add a deterministic assertion for that tool selection pattern to your test suite. Pricing: free up to 5,000 events/month; $40/month for unlimited events.
Comparison Table¶
| Tool | Best For | Pricing Start | Agent-Specific Features |
|---|---|---|---|
| LangSmith | LangChain-based agent tracing + human feedback | Free / $39/seat | Trajectory evaluators, annotation queues |
| Braintrust | Regression testing + CI integration | Free / $249/month | MCP server, experiment leaderboards |
| W&B Weave | Multi-agent tracing + cost tracking | Free / $60/month | Hierarchical traces, cost per call |
| OpenAI Evals | Model output testing, OpenAI ecosystem | Free (API tokens) | Skills-based eval framework |
| Inspect AI | Academic/safety evaluations | Free (open-source) | 100+ pre-built evals, agent sandboxing |
| AgentOps | Production debugging + session replay | Free / $40/month | Time-travel debugging, 400+ LLM cost tracking |
6. A Minimal Eval Setup¶
A research agent retrieves information, reasons over multiple sources, and synthesizes a final answer with citations. The following 3-part eval setup catches the most common failure modes at minimal cost.
Run in order
These layers are ordered by cost: deterministic checks are free, citation verification is cheap (HTTP requests only), and the LLM judge is the most expensive. Fail fast on the cheap layers before running LLM-based evaluation.
Part 1: Deterministic Pre-checks¶
Run before any LLM-based eval. Fast, cheap, zero LLM cost. Catches obvious failures immediately.
import re
from typing import Any
def run_deterministic_checks(agent_output: dict) -> dict:
"""
agent_output expected keys:
- answer: str
- citations: list[dict] with 'url' and 'text' keys
- tool_calls: list[dict] with 'name' and 'args' keys
- latency_ms: int
- total_cost_usd: float
"""
results = {}
# Check 1: Answer is non-empty
results["answer_non_empty"] = len(agent_output.get("answer", "").strip()) > 0
# Check 2: Citations present
citations = agent_output.get("citations", [])
results["has_citations"] = len(citations) >= 1
# Check 3: All citations have valid URL format
url_pattern = re.compile(r"^https?://.+")
results["citation_urls_valid"] = all(
url_pattern.match(c.get("url", "")) for c in citations
)
# Check 4: Required tools were called (search + read for research agents)
tool_names = [tc["name"] for tc in agent_output.get("tool_calls", [])]
results["search_tool_called"] = "search" in tool_names
results["read_tool_called"] = any("read" in t for t in tool_names)
# Check 5: Latency within threshold (60 s for research tasks)
results["latency_ok"] = agent_output.get("latency_ms", 999_999) < 60_000
# Check 6: Cost within threshold ($0.50 per task)
results["cost_ok"] = agent_output.get("total_cost_usd", 999) < 0.50
# Check 7: Answer length reasonable (not truncated, not padded)
answer_len = len(agent_output.get("answer", ""))
results["length_reasonable"] = 50 < answer_len < 5_000
results["all_passed"] = all(results.values())
return results
Part 2: Citation Verification¶
For each citation in the agent's output, verify the URL exists (HTTP 200) and that the cited text is actually present at that URL. No LLM required.
import requests
def verify_citations(citations: list[dict]) -> list[dict]:
"""
For each citation, verify:
- URL resolves (HTTP 200)
- Cited snippet appears in fetched content
"""
results = []
for cite in citations:
url = cite.get("url", "")
cited_text = cite.get("text", "")
result = {"url": url, "cited_text_preview": cited_text[:100]}
try:
resp = requests.get(
url, timeout=10, headers={"User-Agent": "EvalBot/1.0"}
)
result["url_resolves"] = resp.status_code == 200
# Normalize whitespace before substring search
page_content = " ".join(resp.text.split())
normalized_cited = " ".join(cited_text.split())
result["text_found_in_page"] = (
len(normalized_cited) > 10 # non-trivial text
and normalized_cited.lower() in page_content.lower()
)
except Exception as e:
result["url_resolves"] = False
result["text_found_in_page"] = False
result["error"] = str(e)
results.append(result)
return results
Part 3: LLM-as-Judge Faithfulness Check¶
Based on Datadog's hallucination detection approach: frame the task as finding disagreements between context and answer, not confirming agreements. This prevents the judge from taking the easy path of affirming everything.
import json
from openai import OpenAI
client = OpenAI()
FAITHFULNESS_JUDGE_PROMPT = """You are an expert evaluator assessing whether an agent's answer
is faithfully grounded in the provided source documents.
## Source Documents (Ground Truth)
{context}
## Agent's Answer
{answer}
## Citations Used
{citations}
## Evaluation Instructions
Your task is to identify DISAGREEMENTS between the answer and the source documents.
For each claim in the answer:
1. Find the relevant portion of the source documents
2. Determine if the claim is supported, unsupported, or contradicted
Classify each issue as:
- CONTRADICTION: The answer directly contradicts the source
- UNSUPPORTED: The answer makes claims not present in the sources (hallucination)
- AGREEMENT: The claim is supported by sources
## Output Format (JSON)
{{
"issues": [
{{
"claim": "exact quote from answer",
"type": "CONTRADICTION | UNSUPPORTED | AGREEMENT",
"source_quote": "relevant quote from source or null if unsupported",
"explanation": "brief explanation"
}}
],
"faithfulness_score": 0.0,
"verdict": "FAITHFUL | PARTIALLY_FAITHFUL | UNFAITHFUL"
}}
Notes:
- faithfulness_score = fraction of claims classified as AGREEMENT (0.0–1.0)
- FAITHFUL = score >= 0.9, PARTIALLY_FAITHFUL = 0.6–0.9, UNFAITHFUL = < 0.6
"""
def evaluate_faithfulness(
answer: str,
citations: list[dict],
judge_model: str = "gpt-4o",
) -> dict:
context = "\n\n---\n\n".join(
f"Source: {c['url']}\n{c.get('full_text', c.get('text', ''))}"
for c in citations
)
citations_summary = "\n".join(
f"- {c['url']}: '{c.get('text', '')[:200]}...'"
for c in citations
)
response = client.chat.completions.create(
model=judge_model,
messages=[
{"role": "system", "content": "You are a meticulous fact-checking evaluator."},
{
"role": "user",
"content": FAITHFULNESS_JUDGE_PROMPT.format(
context=context,
answer=answer,
citations=citations_summary,
),
},
],
response_format={"type": "json_object"},
temperature=0.0, # deterministic scoring
)
return json.loads(response.choices[0].message.content)
Key design choices (from Datadog's research):
- Frame as finding disagreements—prevents the judge from confirming everything
- Require quotes from both the answer and source for each claim—forces grounding
temperature=0.0for consistency across runs- Structured JSON output for deterministic parsing of the verdict
Production Thresholds¶
Calibrated from Let's Data Science analysis and Monte Carlo Data's production experience:
| Metric | Alert Threshold | Block / Escalate Threshold |
|---|---|---|
| Faithfulness score | < 0.8 | < 0.6 |
| Quality overall (1–5) | < 4.0 | < 3.0 |
| Citation URL validity | < 90% | < 75% |
| Citation text found in page | < 80% | < 60% |
| Latency | > 45 s | > 60 s |
| Cost per task | > $0.40 | > $0.75 |
Calibrate before deploying thresholds
These thresholds are starting points—not universal truths. Evidently AI recommends measuring precision/recall of your LLM judge against a human-labeled held-out set and iterating until agreement is satisfactory before trusting thresholds at scale.
Use trend monitoring, not point-in-time thresholds
Use anomaly detection on score trends over time rather than single-run thresholds. Dropbox tracks evaluation score trends at hourly, 6-hour, and daily intervals. Monte Carlo Data used this approach to catch a real reliability incident before users noticed.