Evaluation
Evaluating agent performance, reliability, and safety is key to building production-grade agents.
SDK Focus
fallbackModel
setModel()
supportedModels()
error codes
accountInfo()
Core Metrics
| Metric | Description | Measurement |
|---|---|---|
| Success rate | Task completion ratio | Successful tasks / total tasks |
| Efficiency | Steps required to complete tasks | Tool calls, token usage |
| Accuracy | Output quality | Test pass rate, human ratings |
| Safety | Rate of unsafe behavior | Policy violations count |
Common Benchmarks
- SWE-bench - Software engineering tasks (fix GitHub issues)
- HumanEval - Code generation capability
- WebArena - Web interaction capability
- GAIA - General AI assistant capability
- AgentBench - Multi-scenario agent capability
Testing Strategy
Unit Tests
def test_file_read_tool():
result = file_read_tool("test.txt")
assert result.success
assert "expected content" in result.content
def test_search_tool():
results = search_tool("function login")
assert len(results) > 0
assert any("login" in r.content for r in results)
Integration Tests
def test_complete_workflow():
agent = Agent(tools=[...])
result = agent.run("Create a hello world function")
assert result.files_created == ["hello.py"]
assert "def hello" in read_file("hello.py")
assert run_tests().passed
Eval-Driven Development
# eval_suite.yaml
- name: "Basic file operations"
tasks:
- input: "Read config.json"
expected: tool_call("read_file", "config.json")
- input: "Create a new file test.py"
expected: file_exists("test.py")
- name: "Error handling"
tasks:
- input: "Read nonexistent.txt"
expected: graceful_error_message()
Observability
# Full execution trace
{
"session_id": "abc123",
"task": "Fix the login bug",
"steps": [
{"action": "read_file", "args": {"path": "auth.py"}, "duration_ms": 45},
{"action": "think", "content": "Found the issue...", "tokens": 150},
{"action": "edit_file", "args": {...}, "duration_ms": 120}
],
"total_tokens": 2500,
"total_duration_ms": 8500,
"success": true
}
Model Routing & Fallback
model-routing.ts
import { query } from "@anthropic-ai/claude-agent-sdk";
// Auto-fallback: if primary fails, use backup
const response = query({
prompt: "Generate a complex migration plan",
options: {
model: "claude-opus-4-5",
fallbackModel: "claude-sonnet-4-5" // Auto-fallback
}
});
// Dynamic model switching mid-session
const q = query({ prompt: "Analyze this code" });
const models = await q.supportedModels(); // List available
await q.setModel("claude-opus-4-5"); // Escalate for hard tasks
// Cost-aware routing
const account = await q.accountInfo();
console.log(`Budget remaining: ${account.credits}`);
// Error code handling
for await (const msg of response) {
if (msg.type === "error") {
switch (msg.error_code) {
case "RATE_LIMIT_EXCEEDED": // Retry with backoff
case "CONTEXT_LENGTH_EXCEEDED": // Rotate session
case "AUTHENTICATION_FAILED": // Check API key
}
}
}
Safety Evaluation
- Privilege escalation tests - Attempt access to restricted resources
- Injection attack tests - Handle malicious inputs
- Resource abuse tests - Infinite loops, excessive API calls
- Data leakage tests - Handling of sensitive information
Best Practices
- Build a continuous evaluation pipeline (CI/CD integration)
- Collect real user feedback
- A/B test different prompts and configurations
- Run regular regression tests
- Monitor production performance
Conclusion
Congrats on completing the Agent development fundamentals! Next you can:
- Chat with an AI mentor - Dive into specific questions
- Hands-on practice - Apply what you learned
- Claude Code case study - Learn production-grade agents
Try It: Cost-Aware Model Router
Build a router that picks the optimal model based on task complexity.
- Use
supportedModels()to list available models - Classify tasks as simple/medium/complex
- Route: Haiku for simple, Sonnet for medium, Opus for complex
- Run 20 trials, compute success rate and average token cost per route
Gate: P5 Complete — 20-trial eval report generated, model router picks optimal model, cost tracked per route.