Evaluation

Evaluating agent performance, reliability, and safety is key to building production-grade agents.

SDK Focus fallbackModel setModel() supportedModels() error codes accountInfo()

Core Metrics

Metric	Description	Measurement
Success rate	Task completion ratio	Successful tasks / total tasks
Efficiency	Steps required to complete tasks	Tool calls, token usage
Accuracy	Output quality	Test pass rate, human ratings
Safety	Rate of unsafe behavior	Policy violations count

Common Benchmarks

SWE-bench - Software engineering tasks (fix GitHub issues)
HumanEval - Code generation capability
WebArena - Web interaction capability
GAIA - General AI assistant capability
AgentBench - Multi-scenario agent capability

Testing Strategy

Unit Tests

def test_file_read_tool():
    result = file_read_tool("test.txt")
    assert result.success
    assert "expected content" in result.content

def test_search_tool():
    results = search_tool("function login")
    assert len(results) > 0
    assert any("login" in r.content for r in results)

Integration Tests

def test_complete_workflow():
    agent = Agent(tools=[...])
    result = agent.run("Create a hello world function")
    
    assert result.files_created == ["hello.py"]
    assert "def hello" in read_file("hello.py")
    assert run_tests().passed

Eval-Driven Development

# eval_suite.yaml
- name: "Basic file operations"
  tasks:
    - input: "Read config.json"
      expected: tool_call("read_file", "config.json")
    - input: "Create a new file test.py"
      expected: file_exists("test.py")

- name: "Error handling"
  tasks:
    - input: "Read nonexistent.txt"
      expected: graceful_error_message()

Observability

# Full execution trace
{
  "session_id": "abc123",
  "task": "Fix the login bug",
  "steps": [
    {"action": "read_file", "args": {"path": "auth.py"}, "duration_ms": 45},
    {"action": "think", "content": "Found the issue...", "tokens": 150},
    {"action": "edit_file", "args": {...}, "duration_ms": 120}
  ],
  "total_tokens": 2500,
  "total_duration_ms": 8500,
  "success": true
}

Model Routing & Fallback model-routing.ts

import { query } from "@anthropic-ai/claude-agent-sdk";

// Auto-fallback: if primary fails, use backup
const response = query({
  prompt: "Generate a complex migration plan",
  options: {
    model: "claude-opus-4-5",
    fallbackModel: "claude-sonnet-4-5"  // Auto-fallback
  }
});

// Dynamic model switching mid-session
const q = query({ prompt: "Analyze this code" });
const models = await q.supportedModels(); // List available
await q.setModel("claude-opus-4-5");     // Escalate for hard tasks

// Cost-aware routing
const account = await q.accountInfo();
console.log(`Budget remaining: ${account.credits}`);

// Error code handling
for await (const msg of response) {
  if (msg.type === "error") {
    switch (msg.error_code) {
      case "RATE_LIMIT_EXCEEDED":  // Retry with backoff
      case "CONTEXT_LENGTH_EXCEEDED":  // Rotate session
      case "AUTHENTICATION_FAILED":  // Check API key
    }
  }
}

Safety Evaluation

Privilege escalation tests - Attempt access to restricted resources
Injection attack tests - Handle malicious inputs
Resource abuse tests - Infinite loops, excessive API calls
Data leakage tests - Handling of sensitive information

Best Practices

Build a continuous evaluation pipeline (CI/CD integration)
Collect real user feedback
A/B test different prompts and configurations
Run regular regression tests
Monitor production performance

Conclusion

Congrats on completing the Agent development fundamentals! Next you can:

Chat with an AI mentor - Dive into specific questions
Hands-on practice - Apply what you learned
Claude Code case study - Learn production-grade agents

Try It: Cost-Aware Model Router

Build a router that picks the optimal model based on task complexity.

Use supportedModels() to list available models
Classify tasks as simple/medium/complex
Route: Haiku for simple, Sonnet for medium, Opus for complex
Run 20 trials, compute success rate and average token cost per route

Gate: P5 Complete — 20-trial eval report generated, model router picks optimal model, cost tracked per route.