Topics / Evaluation & Metrics

Evaluation & Metrics

How do you know if your agent is actually working? Evaluation is the discipline of measuring agent performance across multiple dimensions: component accuracy, task completion, and system-level metrics. This guide covers evaluation taxonomy, agent-specific metrics, and major benchmarks.

Evaluation Taxonomy

┌─────────────────────────────────────────────────────────────────────────────┐
│                          Agent Evaluation Layers                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Layer 3: SYSTEM                                                     │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐    │   │
│  │  │ E2E Latency │ │ Cost/Task   │ │ Safety      │ │ User Pref   │    │   │
│  │  │ P95 < 30s   │ │ $/query     │ │ Compliance  │ │ Win Rate    │    │   │
│  │  └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘    │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    ↑                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Layer 2: TASK                                                       │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐    │   │
│  │  │ Completion  │ │ Step        │ │ Error       │ │ Output      │    │   │
│  │  │ Rate        │ │ Efficiency  │ │ Recovery    │ │ Quality     │    │   │
│  │  └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘    │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    ↑                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Layer 1: COMPONENT                                                  │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐    │   │
│  │  │ Tool Call   │ │ Argument    │ │ Response    │ │ Context     │    │   │
│  │  │ Accuracy    │ │ Extraction  │ │ Format      │ │ Utilization │    │   │
│  │  └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘    │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Evaluation flows bottom-up: Strong component metrics are necessary but not
sufficient for good task metrics, which are necessary but not sufficient
for good system metrics.

Agent evaluation operates at three layers. Component-level metrics measure individual capabilities (tool calling, parsing). Task-level metrics measure goal achievement. System-level metrics measure real-world deployment concerns (latency, cost, safety).

Evaluation Metrics Reference

Layer Metric Description Evaluation Method
Component Tool Calling Accuracy Correct tool selection rate Compare to ground truth tool sequence
Component Argument Extraction F1 Parameter parsing accuracy Compare extracted args to expected
Component Response Format Validity Structured output correctness JSON schema validation
Task Task Completion Rate Goal achievement percentage Binary pass/fail per task
Task Step Efficiency Steps taken vs optimal path Ratio of actual to optimal steps
Task Error Recovery Rate Recovery from failures Track retry success rate
System End-to-End Latency Total response time Measure P50, P95, P99
System Cost per Task Resource consumption Tokens/API calls/dollars
System Safety Compliance Guardrail adherence Red team testing

Evaluation metrics organized by layer

Evaluation Taxonomy Implementation
// Agent Evaluation Taxonomy
// Three-layer evaluation approach

Layer 1: Component Evaluation
├── Tool Calling Accuracy    // Does the agent select correct tools?
├── Argument Extraction      // Are parameters correctly parsed?
├── Response Formatting      // Is output properly structured?
└── Context Utilization      // Does it use available context well?

Layer 2: Task Evaluation
├── Task Completion Rate     // Did the agent achieve the goal?
├── Step Efficiency          // How many steps vs optimal?
├── Error Recovery           // Did it handle failures gracefully?
└── Output Quality           // Is the result correct and complete?

Layer 3: System Evaluation
├── End-to-End Latency       // Total time from request to result
├── Cost Efficiency          // Tokens/API calls per task
├── Safety Compliance        // Did it stay within guardrails?
└── User Satisfaction        // Human preference ratings
from dataclasses import dataclass
from enum import Enum
from typing import List, Optional

class EvaluationLayer(Enum):
    COMPONENT = "component"
    TASK = "task"
    SYSTEM = "system"

@dataclass
class Metric:
    name: str
    layer: EvaluationLayer
    description: str
    higher_is_better: bool = True

# Define evaluation taxonomy
EVALUATION_TAXONOMY = [
    # Layer 1: Component Evaluation
    Metric("tool_calling_accuracy", EvaluationLayer.COMPONENT,
           "Percentage of correct tool selections", True),
    Metric("argument_extraction_f1", EvaluationLayer.COMPONENT,
           "F1 score for parameter extraction", True),
    Metric("response_format_validity", EvaluationLayer.COMPONENT,
           "Percentage of valid JSON/structured outputs", True),
    Metric("context_utilization", EvaluationLayer.COMPONENT,
           "Relevant context retrieval rate", True),

    # Layer 2: Task Evaluation
    Metric("task_completion_rate", EvaluationLayer.TASK,
           "Percentage of successfully completed tasks", True),
    Metric("step_efficiency", EvaluationLayer.TASK,
           "Actual steps / optimal steps ratio", False),
    Metric("error_recovery_rate", EvaluationLayer.TASK,
           "Successful recoveries from errors", True),
    Metric("output_correctness", EvaluationLayer.TASK,
           "Factual accuracy of generated output", True),

    # Layer 3: System Evaluation
    Metric("e2e_latency_p95", EvaluationLayer.SYSTEM,
           "95th percentile end-to-end latency", False),
    Metric("cost_per_task", EvaluationLayer.SYSTEM,
           "Average cost in tokens/dollars per task", False),
    Metric("safety_compliance", EvaluationLayer.SYSTEM,
           "Percentage of responses within guardrails", True),
    Metric("user_preference", EvaluationLayer.SYSTEM,
           "Human preference win rate vs baseline", True),
]

def get_metrics_by_layer(layer: EvaluationLayer) -> List[Metric]:
    """Get all metrics for a specific evaluation layer."""
    return [m for m in EVALUATION_TAXONOMY if m.layer == layer]
public enum EvaluationLayer
{
    Component,
    Task,
    System
}

public record Metric(
    string Name,
    EvaluationLayer Layer,
    string Description,
    bool HigherIsBetter = true
);

public static class EvaluationTaxonomy
{
    public static readonly List<Metric> Metrics = new()
    {
        // Layer 1: Component Evaluation
        new("ToolCallingAccuracy", EvaluationLayer.Component,
            "Percentage of correct tool selections"),
        new("ArgumentExtractionF1", EvaluationLayer.Component,
            "F1 score for parameter extraction"),
        new("ResponseFormatValidity", EvaluationLayer.Component,
            "Percentage of valid JSON/structured outputs"),
        new("ContextUtilization", EvaluationLayer.Component,
            "Relevant context retrieval rate"),

        // Layer 2: Task Evaluation
        new("TaskCompletionRate", EvaluationLayer.Task,
            "Percentage of successfully completed tasks"),
        new("StepEfficiency", EvaluationLayer.Task,
            "Actual steps / optimal steps ratio", false),
        new("ErrorRecoveryRate", EvaluationLayer.Task,
            "Successful recoveries from errors"),
        new("OutputCorrectness", EvaluationLayer.Task,
            "Factual accuracy of generated output"),

        // Layer 3: System Evaluation
        new("E2ELatencyP95", EvaluationLayer.System,
            "95th percentile end-to-end latency", false),
        new("CostPerTask", EvaluationLayer.System,
            "Average cost in tokens/dollars per task", false),
        new("SafetyCompliance", EvaluationLayer.System,
            "Percentage of responses within guardrails"),
        new("UserPreference", EvaluationLayer.System,
            "Human preference win rate vs baseline")
    };

    public static IEnumerable<Metric> GetByLayer(EvaluationLayer layer) =>
        Metrics.Where(m => m.Layer == layer);
}

Agent-Specific Metrics (DeepEval)

DeepEval is an open-source evaluation framework with metrics specifically designed for agentic systems. Unlike traditional NLP metrics, these capture tool use, grounding, and multi-step reasoning.

Tool Correctness

Measures whether the agent called the correct tools with correct arguments. Compares actual tool calls to expected calls.

Faithfulness

Measures whether the agent's response is grounded in retrieved context. Detects hallucination and fabrication.

Answer Relevancy

Measures whether the response actually addresses the user's question. Detects tangential or off-topic responses.

Task Completion

Uses an LLM judge to determine if the agent achieved the stated goal. More nuanced than binary pass/fail.

DeepEval Agent Metrics
// DeepEval: Agent-Specific Metrics
// Framework for evaluating agentic behaviors

function evaluate_agent_response(response, context):
    results = {}

    // 1. Tool Call Correctness
    // Did the agent call the right tools with right arguments?
    results["tool_correctness"] = evaluate_tool_calls(
        actual_calls = response.tool_calls,
        expected_calls = context.ground_truth_tools
    )

    // 2. Faithfulness (Grounding)
    // Is the response grounded in retrieved context?
    results["faithfulness"] = evaluate_faithfulness(
        response = response.text,
        retrieved_context = context.retrieved_docs
    )

    // 3. Answer Relevancy
    // Does the answer address the original question?
    results["relevancy"] = evaluate_relevancy(
        question = context.original_query,
        answer = response.text
    )

    // 4. Task Completion
    // Did the agent complete the assigned task?
    results["task_completion"] = evaluate_task(
        task = context.task_definition,
        result = response.final_output,
        expected = context.expected_output
    )

    // 5. Trajectory Quality
    // Was the agent's reasoning path efficient?
    results["trajectory_quality"] = evaluate_trajectory(
        steps = response.reasoning_trace,
        optimal_path = context.optimal_steps
    )

    return results
from deepeval import evaluate
from deepeval.metrics import (
    ToolCorrectnessMetric,
    FaithfulnessMetric,
    AnswerRelevancyMetric,
    TaskCompletionMetric,
)
from deepeval.test_case import LLMTestCase, ToolCall

# Define test case for agent evaluation
test_case = LLMTestCase(
    input="Find the weather in London and book a restaurant nearby",
    actual_output="The weather in London is 15°C with light rain. I found 'The Ivy' restaurant nearby and booked a table for 7pm.",
    expected_output="Weather retrieved and restaurant booked successfully",
    retrieval_context=[
        "London weather: 15°C, light rain, humidity 80%",
        "Nearby restaurants: The Ivy (0.3mi), Sketch (0.5mi)"
    ],
    tools_called=[
        ToolCall(name="get_weather", args={"city": "London"}),
        ToolCall(name="search_restaurants", args={"location": "London", "cuisine": "any"}),
        ToolCall(name="book_restaurant", args={"restaurant": "The Ivy", "time": "19:00"})
    ],
    expected_tools=[
        ToolCall(name="get_weather", args={"city": "London"}),
        ToolCall(name="search_restaurants", args={"location": "London"}),
        ToolCall(name="book_restaurant", args={})  # Args can vary
    ]
)

# Define metrics
metrics = [
    ToolCorrectnessMetric(
        threshold=0.8,
        include_args=True  # Also check arguments
    ),
    FaithfulnessMetric(
        threshold=0.7,
        model="gpt-4"  # Judge model
    ),
    AnswerRelevancyMetric(
        threshold=0.7,
        model="gpt-4"
    ),
    TaskCompletionMetric(
        threshold=0.8,
        model="gpt-4"
    )
]

# Run evaluation
results = evaluate([test_case], metrics)

# Access results
for metric_result in results:
    print(f"{metric_result.name}: {metric_result.score:.2f}")
    if metric_result.reason:
        print(f"  Reason: {metric_result.reason}")
// Agent evaluation in C# using a custom framework
public interface IAgentMetric
{
    string Name { get; }
    double Threshold { get; }
    Task<MetricResult> EvaluateAsync(AgentTestCase testCase);
}

public record MetricResult(
    string MetricName,
    double Score,
    bool Passed,
    string? Reason = null
);

public record AgentTestCase(
    string Input,
    string ActualOutput,
    string ExpectedOutput,
    List<string> RetrievalContext,
    List<ToolCall> ToolsCalled,
    List<ToolCall> ExpectedTools
);

public class ToolCorrectnessMetric : IAgentMetric
{
    public string Name => "ToolCorrectness";
    public double Threshold { get; init; } = 0.8;
    public bool IncludeArgs { get; init; } = true;

    public Task<MetricResult> EvaluateAsync(AgentTestCase testCase)
    {
        var expectedTools = testCase.ExpectedTools.Select(t => t.Name).ToHashSet();
        var actualTools = testCase.ToolsCalled.Select(t => t.Name).ToHashSet();

        // Calculate precision and recall
        var truePositives = expectedTools.Intersect(actualTools).Count();
        var precision = actualTools.Count > 0
            ? (double)truePositives / actualTools.Count : 0;
        var recall = expectedTools.Count > 0
            ? (double)truePositives / expectedTools.Count : 0;

        // F1 score
        var score = precision + recall > 0
            ? 2 * (precision * recall) / (precision + recall) : 0;

        return Task.FromResult(new MetricResult(
            Name, score, score >= Threshold,
            $"Precision: {precision:P0}, Recall: {recall:P0}"
        ));
    }
}

// Evaluate agent
public class AgentEvaluator
{
    private readonly List<IAgentMetric> _metrics;

    public AgentEvaluator(List<IAgentMetric> metrics)
    {
        _metrics = metrics;
    }

    public async Task<List<MetricResult>> EvaluateAsync(AgentTestCase testCase)
    {
        var results = new List<MetricResult>();
        foreach (var metric in _metrics)
        {
            results.Add(await metric.EvaluateAsync(testCase));
        }
        return results;
    }
}

Trajectory-Level Evaluation

Beyond final outputs, trajectory evaluation examines the agent's reasoning path. This is crucial for understanding why an agent succeeded or failed, and for detecting issues like reasoning loops or inefficient strategies.

Why Trajectory Evaluation?

An agent might reach the correct answer through a convoluted path, wasting tokens and time. Trajectory metrics reveal these inefficiencies that final-answer metrics miss.
Trajectory Evaluation
// Trajectory-Level Evaluation
// Evaluate the full reasoning path, not just final output

function evaluate_trajectory(trajectory, task):
    metrics = {}

    // 1. Step Count Efficiency
    // Compare actual steps to known optimal
    optimal_steps = get_optimal_path(task)
    metrics["step_ratio"] = len(trajectory.steps) / len(optimal_steps)

    // 2. Action Diversity
    // Penalize repetitive actions (agent stuck in loop)
    unique_actions = set(s.action for s in trajectory.steps)
    metrics["action_diversity"] = len(unique_actions) / len(trajectory.steps)

    // 3. Progress Rate
    // How much progress per step toward goal?
    progress_scores = []
    for i, step in enumerate(trajectory.steps):
        progress = estimate_progress(step.state, task.goal)
        progress_scores.append(progress)
    metrics["progress_rate"] = average_improvement(progress_scores)

    // 4. Error Recovery
    // Did agent recover from mistakes?
    errors = [s for s in trajectory.steps if s.is_error]
    recoveries = count_successful_recoveries(errors, trajectory)
    metrics["recovery_rate"] = recoveries / max(len(errors), 1)

    // 5. Reasoning Quality (LLM-as-Judge)
    // Have an LLM evaluate reasoning coherence
    metrics["reasoning_quality"] = llm_judge_reasoning(
        trajectory.reasoning_traces,
        task.description
    )

    return metrics
from dataclasses import dataclass
from typing import List, Dict, Any
import numpy as np

@dataclass
class TrajectoryStep:
    state: Dict[str, Any]
    action: str
    observation: str
    reasoning: str
    is_error: bool = False

@dataclass
class AgentTrajectory:
    steps: List[TrajectoryStep]
    final_result: Any
    task_completed: bool

class TrajectoryEvaluator:
    def __init__(self, llm_judge=None):
        self.llm_judge = llm_judge

    def evaluate(self,
                 trajectory: AgentTrajectory,
                 optimal_steps: int = None) -> Dict[str, float]:
        metrics = {}

        # 1. Step efficiency
        if optimal_steps:
            metrics["step_efficiency"] = min(
                1.0, optimal_steps / len(trajectory.steps)
            )

        # 2. Action diversity (detect loops)
        actions = [s.action for s in trajectory.steps]
        metrics["action_diversity"] = len(set(actions)) / len(actions)

        # 3. Detect repetition (exact action sequences)
        metrics["repetition_score"] = self._detect_repetition(actions)

        # 4. Error recovery rate
        errors = [s for s in trajectory.steps if s.is_error]
        if errors:
            recoveries = self._count_recoveries(trajectory, errors)
            metrics["recovery_rate"] = recoveries / len(errors)
        else:
            metrics["recovery_rate"] = 1.0  # No errors = perfect

        # 5. LLM-as-judge for reasoning quality
        if self.llm_judge:
            metrics["reasoning_quality"] = self._judge_reasoning(
                trajectory
            )

        return metrics

    def _detect_repetition(self, actions: List[str]) -> float:
        """Return 1.0 if no repetition, lower if patterns repeat."""
        if len(actions) < 4:
            return 1.0

        # Check for repeating patterns of length 2-3
        for pattern_len in [2, 3]:
            for i in range(len(actions) - pattern_len * 2):
                pattern = actions[i:i + pattern_len]
                next_seq = actions[i + pattern_len:i + pattern_len * 2]
                if pattern == next_seq:
                    return 0.5  # Repetition detected
        return 1.0

    def _count_recoveries(self,
                          trajectory: AgentTrajectory,
                          errors: List[TrajectoryStep]) -> int:
        """Count how many errors led to successful recovery."""
        recoveries = 0
        for error in errors:
            error_idx = trajectory.steps.index(error)
            # Check if subsequent steps show different approach
            if error_idx + 1 < len(trajectory.steps):
                next_step = trajectory.steps[error_idx + 1]
                if next_step.action != error.action:
                    recoveries += 1
        return recoveries

    def _judge_reasoning(self, trajectory: AgentTrajectory) -> float:
        """Use LLM to judge reasoning quality."""
        reasoning_trace = "\n".join(
            f"Step {i}: {s.reasoning}"
            for i, s in enumerate(trajectory.steps)
        )

        prompt = f"""Evaluate the quality of this agent's reasoning trace.

Reasoning trace:
{reasoning_trace}

Rate from 0-1 based on:
- Logical coherence
- Goal-directed behavior
- Appropriate use of observations
- Clear decision making

Return only a number between 0 and 1."""

        response = self.llm_judge.generate(prompt)
        return float(response.strip())
public record TrajectoryStep(
    Dictionary<string, object> State,
    string Action,
    string Observation,
    string Reasoning,
    bool IsError = false
);

public record AgentTrajectory(
    List<TrajectoryStep> Steps,
    object FinalResult,
    bool TaskCompleted
);

public class TrajectoryEvaluator
{
    private readonly ILLMJudge? _llmJudge;

    public TrajectoryEvaluator(ILLMJudge? llmJudge = null)
    {
        _llmJudge = llmJudge;
    }

    public async Task<Dictionary<string, double>> EvaluateAsync(
        AgentTrajectory trajectory,
        int? optimalSteps = null)
    {
        var metrics = new Dictionary<string, double>();

        // 1. Step efficiency
        if (optimalSteps.HasValue)
        {
            metrics["step_efficiency"] = Math.Min(
                1.0, (double)optimalSteps.Value / trajectory.Steps.Count
            );
        }

        // 2. Action diversity
        var actions = trajectory.Steps.Select(s => s.Action).ToList();
        metrics["action_diversity"] =
            (double)actions.Distinct().Count() / actions.Count;

        // 3. Repetition detection
        metrics["repetition_score"] = DetectRepetition(actions);

        // 4. Error recovery
        var errors = trajectory.Steps.Where(s => s.IsError).ToList();
        if (errors.Any())
        {
            var recoveries = CountRecoveries(trajectory, errors);
            metrics["recovery_rate"] = (double)recoveries / errors.Count;
        }
        else
        {
            metrics["recovery_rate"] = 1.0;
        }

        // 5. LLM judge
        if (_llmJudge != null)
        {
            metrics["reasoning_quality"] =
                await JudgeReasoningAsync(trajectory);
        }

        return metrics;
    }

    private double DetectRepetition(List<string> actions)
    {
        if (actions.Count < 4) return 1.0;

        for (int patternLen = 2; patternLen <= 3; patternLen++)
        {
            for (int i = 0; i < actions.Count - patternLen * 2; i++)
            {
                var pattern = actions.Skip(i).Take(patternLen).ToList();
                var next = actions.Skip(i + patternLen).Take(patternLen).ToList();
                if (pattern.SequenceEqual(next)) return 0.5;
            }
        }
        return 1.0;
    }

    private int CountRecoveries(
        AgentTrajectory trajectory,
        List<TrajectoryStep> errors)
    {
        int recoveries = 0;
        foreach (var error in errors)
        {
            int idx = trajectory.Steps.IndexOf(error);
            if (idx + 1 < trajectory.Steps.Count)
            {
                var next = trajectory.Steps[idx + 1];
                if (next.Action != error.Action) recoveries++;
            }
        }
        return recoveries;
    }

    private async Task<double> JudgeReasoningAsync(AgentTrajectory trajectory)
    {
        var trace = string.Join("\n", trajectory.Steps.Select(
            (s, i) => $"Step {i}: {s.Reasoning}"
        ));

        var result = await _llmJudge!.JudgeAsync(trace);
        return result;
    }
}

Major Agent Benchmarks

Benchmarks provide standardized tasks for comparing agent capabilities. Each focuses on different aspects of agentic behavior.

Benchmark Domain Tasks Primary Metric Top Score (2025)
SWE-bench Verified Software Engineering 500 Resolved Rate ~72% (Claude 3.5 Sonnet)
SWE-bench Full Software Engineering 2,294 Resolved Rate ~51%
WebArena Web Navigation 812 Task Success Rate ~42%
GAIA Level 1 General Assistant ~165 Exact Match ~75%
GAIA Level 2 General Assistant ~186 Exact Match ~60%
GAIA Level 3 General Assistant ~115 Exact Match ~40%
t-bench Tool + Conversation 680 Pass Rate ~45% (Retail)
AgentBench Multi-Environment 1,632 Overall Score Model-dependent
HumanEval Code Generation 164 Pass@1 ~95%

Major agent benchmarks comparison

Benchmark Details

SWE-bench: Software Engineering

SWE-bench evaluates agents on real GitHub issues from popular Python repositories (Django, Flask, scikit-learn, etc.). The agent must understand the issue, locate relevant code, and generate a patch that passes the repository's test suite.

Code Understanding Code Generation Test-Driven

Variants

SWE-bench Verified (500 tasks) uses human-verified ground truth. SWE-bench Lite (300 tasks) contains simpler issues for faster iteration. SWE-bench Full (2,294 tasks) is the complete dataset.

WebArena: Web Navigation

WebArena tests agents on realistic web navigation tasks across 5 self-hosted websites (shopping, forums, code hosting, maps, wiki). Tasks include filling forms, navigating menus, and extracting information.

Visual Understanding Action Planning Real Websites

τ-bench (Tau-bench): Tool + Conversation

τ-bench evaluates agents on multi-turn conversations that require tool use. Unlike single-turn benchmarks, it tests the agent's ability to maintain context across turns while calling appropriate tools.

Multi-Turn Tool Use Conversation

Domains

τ-bench covers three domains: Airline (booking, changes, cancellations), Retail (orders, returns, product search), and Banking (transactions, account management). Each domain has simulated APIs.

GAIA: General AI Assistant

GAIA tests general-purpose assistant capabilities with questions requiring web search, file processing, calculation, and multi-step reasoning. Questions are designed so humans can answer them but require agentic capabilities for AI.

Web Search Multi-Step Reasoning File Processing
Running Benchmark Evaluations
// Major Agent Benchmarks Overview

// SWE-bench: Software Engineering
// - 2,294 real GitHub issues from popular Python repos
// - Agent must: understand issue, locate code, implement fix
// - Evaluated by running repository's test suite
SWEBench:
    input: GitHub issue description + repository snapshot
    output: Git patch that fixes the issue
    evaluation: run_tests(patched_repo) → pass/fail
    variants:
        - SWE-bench Full: all 2,294 issues
        - SWE-bench Verified: 500 human-verified subset
        - SWE-bench Lite: 300 simpler issues

// WebArena: Web Navigation
// - 812 tasks across 5 real websites
// - Agent must navigate, fill forms, extract info
// - Evaluated by checking final page state
WebArena:
    input: Natural language instruction
    output: Sequence of browser actions
    evaluation: check_page_state(final_page) → match/no_match
    websites: shopping, reddit, gitlab, maps, wikipedia

// τ-bench (Tau-bench): Tool Use + Conversation
// - 680 multi-turn conversations requiring tool use
// - Tests realistic agentic assistance scenarios
// - Evaluates both tool use and conversational ability
TauBench:
    input: Multi-turn user conversation
    output: Agent responses + tool calls
    evaluation:
        - Tool call correctness
        - Response helpfulness
        - Goal achievement
    domains: airline, retail, banking (simulated APIs)

// GAIA: General AI Assistant
// - 466 questions requiring multi-step reasoning
// - Three difficulty levels
// - Requires web search, calculation, reasoning
GAIA:
    input: Question (may include files/images)
    output: Final answer (short text)
    evaluation: exact_match(answer, ground_truth)
    levels:
        - Level 1: 1-2 steps
        - Level 2: 3-5 steps
        - Level 3: 6+ steps
# Running SWE-bench evaluation
from swebench.harness.run_evaluation import run_evaluation
from swebench.harness.constants import KEY_INSTANCE_ID, KEY_MODEL, KEY_PREDICTION

# Prepare predictions (your agent's patches)
predictions = [
    {
        KEY_INSTANCE_ID: "django__django-11039",
        KEY_MODEL: "my_agent_v1",
        KEY_PREDICTION: '''
--- a/django/db/models/sql/query.py
+++ b/django/db/models/sql/query.py
@@ -1234,6 +1234,8 @@ class Query:
     def add_ordering(self, *ordering):
+        if not ordering:
+            return
         errors = []
'''
    }
]

# Run evaluation
results = run_evaluation(
    predictions=predictions,
    swe_bench_tasks="princeton-nlp/SWE-bench_Verified",
    log_dir="./logs",
    timeout=1800  # 30 min per instance
)

# Results structure
for result in results:
    print(f"Instance: {result['instance_id']}")
    print(f"  Resolved: {result['resolved']}")
    print(f"  Tests passed: {result['tests_passed']}")


# Running GAIA evaluation
from datasets import load_dataset

# Load GAIA dataset
gaia = load_dataset("gaia-benchmark/GAIA", "2023_all")

def evaluate_gaia_response(prediction: str, ground_truth: str) -> bool:
    """GAIA uses relaxed exact match."""
    # Normalize both strings
    pred = prediction.lower().strip()
    truth = ground_truth.lower().strip()

    # Remove common suffixes/prefixes
    pred = pred.rstrip('.')
    truth = truth.rstrip('.')

    return pred == truth

# Evaluate your agent
correct = 0
total = 0
for level in [1, 2, 3]:
    level_data = gaia['test'].filter(lambda x: x['level'] == level)
    for item in level_data:
        agent_answer = my_agent.answer(item['question'], item.get('file'))
        if evaluate_gaia_response(agent_answer, item['ground_truth']):
            correct += 1
        total += 1
    print(f"Level {level}: {correct}/{total} = {correct/total:.1%}")
// Benchmark evaluation framework in C#
public interface IBenchmark
{
    string Name { get; }
    Task<BenchmarkResult> EvaluateAsync(IAgent agent);
}

public record BenchmarkResult(
    string BenchmarkName,
    int TotalTasks,
    int Passed,
    double Accuracy,
    Dictionary<string, double> CategoryScores,
    TimeSpan TotalTime
);

public class SWEBenchEvaluator : IBenchmark
{
    public string Name => "SWE-bench";
    private readonly List<SWEBenchTask> _tasks;
    private readonly IDockerRunner _docker;

    public SWEBenchEvaluator(List<SWEBenchTask> tasks, IDockerRunner docker)
    {
        _tasks = tasks;
        _docker = docker;
    }

    public async Task<BenchmarkResult> EvaluateAsync(IAgent agent)
    {
        var sw = Stopwatch.StartNew();
        var results = new List<TaskResult>();
        var categoryScores = new Dictionary<string, List<bool>>();

        foreach (var task in _tasks)
        {
            // Agent generates patch
            var patch = await agent.GeneratePatchAsync(
                task.IssueDescription,
                task.RepoSnapshot
            );

            // Apply patch and run tests in Docker
            var testResult = await _docker.RunTestsAsync(
                task.RepoSnapshot,
                patch,
                task.TestCommand,
                timeoutSeconds: 1800
            );

            var passed = testResult.AllTestsPassed;
            results.Add(new TaskResult(task.InstanceId, passed));

            // Track by category
            if (!categoryScores.ContainsKey(task.Category))
                categoryScores[task.Category] = new List<bool>();
            categoryScores[task.Category].Add(passed);
        }

        sw.Stop();
        return new BenchmarkResult(
            Name,
            _tasks.Count,
            results.Count(r => r.Passed),
            (double)results.Count(r => r.Passed) / _tasks.Count,
            categoryScores.ToDictionary(
                kv => kv.Key,
                kv => (double)kv.Value.Count(p => p) / kv.Value.Count
            ),
            sw.Elapsed
        );
    }
}

public class GAIAEvaluator : IBenchmark
{
    public string Name => "GAIA";

    public async Task<BenchmarkResult> EvaluateAsync(IAgent agent)
    {
        var levelScores = new Dictionary<int, (int passed, int total)>
        {
            [1] = (0, 0), [2] = (0, 0), [3] = (0, 0)
        };

        foreach (var task in LoadGAIATasks())
        {
            var answer = await agent.AnswerAsync(task.Question, task.File);
            var passed = EvaluateAnswer(answer, task.GroundTruth);

            var (p, t) = levelScores[task.Level];
            levelScores[task.Level] = (p + (passed ? 1 : 0), t + 1);
        }

        var totalPassed = levelScores.Values.Sum(x => x.passed);
        var totalTasks = levelScores.Values.Sum(x => x.total);

        return new BenchmarkResult(
            Name, totalTasks, totalPassed,
            (double)totalPassed / totalTasks,
            levelScores.ToDictionary(
                kv => $"Level {kv.Key}",
                kv => (double)kv.Value.passed / kv.Value.total
            ),
            TimeSpan.Zero
        );
    }

    private bool EvaluateAnswer(string prediction, string groundTruth)
    {
        // GAIA relaxed exact match
        return prediction.Trim().ToLower().TrimEnd('.')
            == groundTruth.Trim().ToLower().TrimEnd('.');
    }
}

LLM-as-Judge

Many agent behaviors are hard to evaluate with deterministic metrics. LLM-as-Judge uses a capable model to evaluate agent outputs, providing more nuanced assessment.

┌──────────────────────────────────────────────────────────────────┐
│                      LLM-as-Judge Pipeline                        │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│   ┌─────────────┐    ┌──────────────┐    ┌─────────────────┐     │
│   │ Agent       │    │ Judge        │    │ Structured      │     │
│   │ Output      │───▶│ Prompt       │───▶│ Evaluation      │     │
│   │             │    │ + Rubric     │    │                 │     │
│   └─────────────┘    └──────────────┘    │ • Score: 0-1    │     │
│                                          │ • Reasoning     │     │
│   ┌─────────────┐                        │ • Suggestions   │     │
│   │ Ground      │                        └─────────────────┘     │
│   │ Truth       │────────▶ Compare                               │
│   │ (optional)  │                                                │
│   └─────────────┘                                                │
│                                                                   │
│   Judge Models: GPT-4, Claude 3 Opus, Gemini 1.5 Pro             │
└──────────────────────────────────────────────────────────────────┘

Judge Bias

LLM judges have known biases: they prefer verbose responses, may favor their own writing style, and can be fooled by confident-sounding but incorrect answers. Use multiple judges and calibrate against human labels.

Evaluation Best Practices

Do

  • Evaluate at multiple layers (component, task, system)
  • Use held-out test sets separate from development
  • Include adversarial/edge cases in test suite
  • Track metrics over time to detect regression
  • Combine automated metrics with human evaluation
  • Report confidence intervals, not just point estimates

Don't

  • Rely only on final answer accuracy
  • Overfit to benchmark-specific patterns
  • Ignore trajectory quality (how the answer was reached)
  • Skip safety/guardrail evaluation
  • Use only synthetic test cases
  • Assume benchmark scores reflect production performance

Related Topics