Skip to main content

Evaluation

Evaluating agent performance, reliability, and safety is key to building production-grade agents.

SDK Focus fallbackModel setModel() supportedModels() error codes accountInfo()

Core Metrics

Metric Description Measurement
Success rate Task completion ratio Successful tasks / total tasks
Efficiency Steps required to complete tasks Tool calls, token usage
Accuracy Output quality Test pass rate, human ratings
Safety Rate of unsafe behavior Policy violations count

Common Benchmarks

  • SWE-bench - Software engineering tasks (fix GitHub issues)
  • HumanEval - Code generation capability
  • WebArena - Web interaction capability
  • GAIA - General AI assistant capability
  • AgentBench - Multi-scenario agent capability

Testing Strategy

Unit Tests

def test_file_read_tool():
    result = file_read_tool("test.txt")
    assert result.success
    assert "expected content" in result.content

def test_search_tool():
    results = search_tool("function login")
    assert len(results) > 0
    assert any("login" in r.content for r in results)

Integration Tests

def test_complete_workflow():
    agent = Agent(tools=[...])
    result = agent.run("Create a hello world function")
    
    assert result.files_created == ["hello.py"]
    assert "def hello" in read_file("hello.py")
    assert run_tests().passed

Eval-Driven Development

# eval_suite.yaml
- name: "Basic file operations"
  tasks:
    - input: "Read config.json"
      expected: tool_call("read_file", "config.json")
    - input: "Create a new file test.py"
      expected: file_exists("test.py")

- name: "Error handling"
  tasks:
    - input: "Read nonexistent.txt"
      expected: graceful_error_message()

Observability

# Full execution trace
{
  "session_id": "abc123",
  "task": "Fix the login bug",
  "steps": [
    {"action": "read_file", "args": {"path": "auth.py"}, "duration_ms": 45},
    {"action": "think", "content": "Found the issue...", "tokens": 150},
    {"action": "edit_file", "args": {...}, "duration_ms": 120}
  ],
  "total_tokens": 2500,
  "total_duration_ms": 8500,
  "success": true
}
Model Routing & Fallback model-routing.ts
import { query } from "@anthropic-ai/claude-agent-sdk";

// Auto-fallback: if primary fails, use backup
const response = query({
  prompt: "Generate a complex migration plan",
  options: {
    model: "claude-opus-4-5",
    fallbackModel: "claude-sonnet-4-5"  // Auto-fallback
  }
});

// Dynamic model switching mid-session
const q = query({ prompt: "Analyze this code" });
const models = await q.supportedModels(); // List available
await q.setModel("claude-opus-4-5");     // Escalate for hard tasks

// Cost-aware routing
const account = await q.accountInfo();
console.log(`Budget remaining: ${account.credits}`);

// Error code handling
for await (const msg of response) {
  if (msg.type === "error") {
    switch (msg.error_code) {
      case "RATE_LIMIT_EXCEEDED":  // Retry with backoff
      case "CONTEXT_LENGTH_EXCEEDED":  // Rotate session
      case "AUTHENTICATION_FAILED":  // Check API key
    }
  }
}

Safety Evaluation

  • Privilege escalation tests - Attempt access to restricted resources
  • Injection attack tests - Handle malicious inputs
  • Resource abuse tests - Infinite loops, excessive API calls
  • Data leakage tests - Handling of sensitive information

Best Practices

  • Build a continuous evaluation pipeline (CI/CD integration)
  • Collect real user feedback
  • A/B test different prompts and configurations
  • Run regular regression tests
  • Monitor production performance

Conclusion

Congrats on completing the Agent development fundamentals! Next you can:

Try It: Cost-Aware Model Router

Build a router that picks the optimal model based on task complexity.

  1. Use supportedModels() to list available models
  2. Classify tasks as simple/medium/complex
  3. Route: Haiku for simple, Sonnet for medium, Opus for complex
  4. Run 20 trials, compute success rate and average token cost per route
Gate: P5 Complete — 20-trial eval report generated, model router picks optimal model, cost tracked per route.