Production Agents
The final frontier: taking AI agents from prototype to production. Master AgentOps, observability, guardrails, cost management, and deployment strategies.
The PoC-to-Production Gap
Most agent demos work in controlled settings but fail in production. The gap includes:
- Non-determinism — Same input can produce different outputs
- Latency — Multi-step agent loops add seconds to minutes of delay
- Cost — Uncontrolled LLM calls can increase cloud costs by 200%
- Reliability — Autonomous agents complete only ~50% of common workflows
- Debugging — 67% of failures are discovered by users, not monitoring
AgentOps Lifecycle
AgentOps is to AI agents what DevOps is to traditional software:
┌─────────────────────────────────────────────────────┐
│ AgentOps Lifecycle │
├─────────────────────────────────────────────────────┤
│ │
│ Plan → Build → Evaluate → Deploy → Monitor → Improve
│ ↑ │
│ └────────────── Feedback Loop ───────────────┘
│ │
└─────────────────────────────────────────────────────┘
Plan: Define outcomes, policies, success metrics
Build: Small tools, structured prompts, testable agents
Evaluate: Happy paths, edge cases, failure modes
Deploy: Shadow mode → Canary → Progressive rollout
Monitor: Traces, metrics, alerts, dashboards
Improve: Error taxonomy → Fix prompts/tools/retrieval
Observability Stack
Teams with comprehensive observability debug issues 10x faster (12 min vs 2 hours):
Key Metrics
metrics:
success_rate: target: ">99%", alert: "<95% for 5min"
p95_latency: target: "<3s", alert: ">10s sustained"
token_usage: budget: "100K/day", alert: ">150% of budget"
error_rate: target: "<1%", alert: ">5% for 5min"
escalation_rate: target: "10-15%", alert: ">30%"
cost_per_task: target: "<$0.50", alert: ">$2.00"
Distributed Tracing
// Every agent action gets a trace
{
"trace_id": "abc-123",
"session_id": "user-session-456",
"spans": [
{
"name": "read_file",
"duration_ms": 45,
"tokens_in": 200,
"tokens_out": 1500,
"status": "ok"
},
{
"name": "llm_decision",
"duration_ms": 2100,
"tokens_in": 3000,
"tokens_out": 500,
"model": "claude-sonnet-4",
"status": "ok"
},
{
"name": "edit_file",
"duration_ms": 120,
"status": "ok"
}
],
"total_cost": "$0.12",
"success": true
}
Recommended Stack
- Metrics: Prometheus + Grafana
- Tracing: Jaeger / OpenTelemetry
- Agent-specific: LangSmith, AgentOps, Langfuse
- Logging: Structured JSON with consistent trace IDs
Safety Guardrails
Translate abstract policies into machine-verifiable controls:
guardrails:
input_validation:
- max_tokens: 10000
- blocked_patterns: ["rm -rf /", "DROP TABLE", "sudo"]
- content_filter: enabled
output_validation:
- no_pii_in_responses: true
- max_actions_per_turn: 10
- forbidden_tools_in_production: ["delete_database"]
runtime_limits:
- max_loop_iterations: 20
- max_total_tokens: 500000
- timeout_seconds: 300
- max_cost_per_session: "$5.00"
failure_modes: # Microsoft's taxonomy
- security_lapses: → block + alert
- hallucinated_tools: → validate tool exists
- planning_loops: → iteration limit + escalate
- memory_poisoning: → input sanitization
import { query } from "@anthropic-ai/claude-agent-sdk";
const response = query({
prompt: "Run the deployment pipeline",
options: {
sandbox: {
enabled: true,
autoAllowBashIfSandboxed: true,
excludedCommands: ["rm", "dd", "mkfs"],
allowUnsandboxedCommands: false,
network: {
enabled: true,
proxyUrl: "http://proxy.internal:8080"
}
},
plugins: ["./plugins/deploy-validator"],
settingSources: ["project"],
permissionMode: "default",
hooks: {
SessionStart: async () => { metrics.startTimer(); },
PreToolUse: async (input) => {
metrics.trackToolCall(input.toolName);
return { allow: true };
},
PostToolUse: async (input) => {
if (input.result?.isError) metrics.trackError(input.toolName);
},
Error: async (input) => {
alerting.send(`Agent error: ${input.error_code}`);
},
Stop: async () => {
const report = metrics.getReport();
console.log(`Duration: ${report.duration}ms`);
console.log(`Tools: ${report.toolCalls}, Errors: ${report.errors}`);
}
}
}
});
14 Known Issues to Watch
The SDK has 14 documented issues including: CLI not found, MCP type field missing (cryptic exit code 1), Unicode line separators breaking JSON, context overflow making sessions unrecoverable, and subagent cleanup. Each is a production lesson — review them all before deploying.
Deployment Strategies
Progressive Rollout
Stage 1: Shadow Mode (1 week)
Agent runs in parallel, outputs logged but not used
Compare with human baseline
Stage 2: Canary (1 week)
5% of traffic → agent
Monitor all metrics closely
Stage 3: Progressive (2 weeks)
5% → 25% → 50% → 100%
Automated rollback on metric degradation
Stage 4: Full Production
100% with full observability
Maintain rollback capability
Rollback & Circuit Breakers
circuit_breaker:
error_threshold: 10%
window: 5_minutes
action: fallback_to_human
# Regularly test rollback
drill_schedule: "monthly"
Cost Management
Token costs are the #1 surprise in production agents:
- Budget per task — Set hard limits (e.g., $5 max per session)
- Model routing — Use fast/cheap models for exploration, capable models for decisions
- Context pruning — Aggressively manage what's in the context window
- Caching — Cache tool results, prompt prefixes, and common retrieval results
- Token monitoring — Teams processing 100M+ daily requests save $50K+/month with proper monitoring
Continuous Evaluation
# CI/CD pipeline for agents
agent_pipeline:
on_prompt_change:
- run: eval_suite.yaml
- assert: success_rate >= 0.95
- assert: avg_cost <= $0.50
- assert: p95_latency <= 5s
on_tool_change:
- run: integration_tests
- run: safety_tests
weekly:
- run: full_regression_suite
- run: adversarial_tests
- generate: performance_report
Production Readiness Checklist
- Observability: traces, metrics, alerts configured
- Guardrails: input/output validation, runtime limits
- HITL: approval workflows for high-risk actions
- Cost controls: per-session budgets, model routing
- Deployment: progressive rollout with rollback
- Evaluation: CI/CD eval pipeline, regression suite
- Documentation: runbook, incident response plan
- Security: audit trail, permission tiers, sandboxing
Congratulations!
You've completed the full AgentWay curriculum. You now understand:
- Agent fundamentals (P1-P3): basics, tools, prompts
- Core systems (P4-P6): memory, evaluation, design patterns
- Advanced topics (P7-P12): MCP, RAG, multi-agent, code gen, HITL, production
What's next?
- Chat with an AI Mentor — Dive deeper into specific topics
- Hands-on Exercises — Apply what you've learned
- Claude Code Deep Dive — Study a production-grade agent
Try It: Final Capstone — End-to-End Agent
Build a production-ready agent using the full SDK surface area.
- MCP tools + subagents for task decomposition
- Eval harness: N-trial report with success rate
sandbox{}for safe execution- All 12 hooks for observability and control
- Cost report with token usage breakdown
- Session management with rotation before context limit
- Public repo with README, tests, and demo