Production Agents

The final frontier: taking AI agents from prototype to production. Master AgentOps, observability, guardrails, cost management, and deployment strategies.

SDK Focus sandbox{} all 12 hooks plugins[] known issues (14)

The PoC-to-Production Gap

Most agent demos work in controlled settings but fail in production. The gap includes:

Non-determinism — Same input can produce different outputs
Latency — Multi-step agent loops add seconds to minutes of delay
Cost — Uncontrolled LLM calls can increase cloud costs by 200%
Reliability — Autonomous agents complete only ~50% of common workflows
Debugging — 67% of failures are discovered by users, not monitoring

AgentOps Lifecycle

AgentOps is to AI agents what DevOps is to traditional software:

┌─────────────────────────────────────────────────────┐
│                 AgentOps Lifecycle                    │
├─────────────────────────────────────────────────────┤
│                                                      │
│   Plan → Build → Evaluate → Deploy → Monitor → Improve
│    ↑                                            │
│    └────────────── Feedback Loop ───────────────┘
│                                                      │
└─────────────────────────────────────────────────────┘

Plan:    Define outcomes, policies, success metrics
Build:   Small tools, structured prompts, testable agents
Evaluate: Happy paths, edge cases, failure modes
Deploy:  Shadow mode → Canary → Progressive rollout
Monitor: Traces, metrics, alerts, dashboards
Improve: Error taxonomy → Fix prompts/tools/retrieval

Observability Stack

Teams with comprehensive observability debug issues 10x faster (12 min vs 2 hours):

Key Metrics

metrics:
  success_rate:    target: ">99%",   alert: "<95% for 5min"
  p95_latency:     target: "<3s",    alert: ">10s sustained"
  token_usage:     budget: "100K/day", alert: ">150% of budget"
  error_rate:      target: "<1%",    alert: ">5% for 5min"
  escalation_rate: target: "10-15%", alert: ">30%"
  cost_per_task:   target: "<$0.50", alert: ">$2.00"

Distributed Tracing

// Every agent action gets a trace
{
  "trace_id": "abc-123",
  "session_id": "user-session-456",
  "spans": [
    {
      "name": "read_file",
      "duration_ms": 45,
      "tokens_in": 200,
      "tokens_out": 1500,
      "status": "ok"
    },
    {
      "name": "llm_decision",
      "duration_ms": 2100,
      "tokens_in": 3000,
      "tokens_out": 500,
      "model": "claude-sonnet-4",
      "status": "ok"
    },
    {
      "name": "edit_file",
      "duration_ms": 120,
      "status": "ok"
    }
  ],
  "total_cost": "$0.12",
  "success": true
}

Recommended Stack

Metrics: Prometheus + Grafana
Tracing: Jaeger / OpenTelemetry
Agent-specific: LangSmith, AgentOps, Langfuse
Logging: Structured JSON with consistent trace IDs

Safety Guardrails

Translate abstract policies into machine-verifiable controls:

guardrails:
  input_validation:
    - max_tokens: 10000
    - blocked_patterns: ["rm -rf /", "DROP TABLE", "sudo"]
    - content_filter: enabled

  output_validation:
    - no_pii_in_responses: true
    - max_actions_per_turn: 10
    - forbidden_tools_in_production: ["delete_database"]

  runtime_limits:
    - max_loop_iterations: 20
    - max_total_tokens: 500000
    - timeout_seconds: 300
    - max_cost_per_session: "$5.00"

  failure_modes:   # Microsoft's taxonomy
    - security_lapses:    → block + alert
    - hallucinated_tools: → validate tool exists
    - planning_loops:     → iteration limit + escalate
    - memory_poisoning:   → input sanitization

Production Sandbox production.ts

import { query } from "@anthropic-ai/claude-agent-sdk";

const response = query({
  prompt: "Run the deployment pipeline",
  options: {
    sandbox: {
      enabled: true,
      autoAllowBashIfSandboxed: true,
      excludedCommands: ["rm", "dd", "mkfs"],
      allowUnsandboxedCommands: false,
      network: {
        enabled: true,
        proxyUrl: "http://proxy.internal:8080"
      }
    },
    plugins: ["./plugins/deploy-validator"],
    settingSources: ["project"],
    permissionMode: "default",

    hooks: {
      SessionStart: async () => { metrics.startTimer(); },
      PreToolUse: async (input) => {
        metrics.trackToolCall(input.toolName);
        return { allow: true };
      },
      PostToolUse: async (input) => {
        if (input.result?.isError) metrics.trackError(input.toolName);
      },
      Error: async (input) => {
        alerting.send(`Agent error: ${input.error_code}`);
      },
      Stop: async () => {
        const report = metrics.getReport();
        console.log(`Duration: ${report.duration}ms`);
        console.log(`Tools: ${report.toolCalls}, Errors: ${report.errors}`);
      }
    }
  }
});

14 Known Issues to Watch

The SDK has 14 documented issues including: CLI not found, MCP type field missing (cryptic exit code 1), Unicode line separators breaking JSON, context overflow making sessions unrecoverable, and subagent cleanup. Each is a production lesson — review them all before deploying.

Deployment Strategies

Progressive Rollout

Stage 1: Shadow Mode (1 week)
  Agent runs in parallel, outputs logged but not used
  Compare with human baseline

Stage 2: Canary (1 week)
  5% of traffic → agent
  Monitor all metrics closely

Stage 3: Progressive (2 weeks)
  5% → 25% → 50% → 100%
  Automated rollback on metric degradation

Stage 4: Full Production
  100% with full observability
  Maintain rollback capability

Rollback & Circuit Breakers

circuit_breaker:
  error_threshold: 10%
  window: 5_minutes
  action: fallback_to_human
  
  # Regularly test rollback
  drill_schedule: "monthly"

Cost Management

Token costs are the #1 surprise in production agents:

Budget per task — Set hard limits (e.g., $5 max per session)
Model routing — Use fast/cheap models for exploration, capable models for decisions
Context pruning — Aggressively manage what's in the context window
Caching — Cache tool results, prompt prefixes, and common retrieval results
Token monitoring — Teams processing 100M+ daily requests save $50K+/month with proper monitoring

Continuous Evaluation

# CI/CD pipeline for agents
agent_pipeline:
  on_prompt_change:
    - run: eval_suite.yaml
    - assert: success_rate >= 0.95
    - assert: avg_cost <= $0.50
    - assert: p95_latency <= 5s
    
  on_tool_change:
    - run: integration_tests
    - run: safety_tests
    
  weekly:
    - run: full_regression_suite
    - run: adversarial_tests
    - generate: performance_report

Production Readiness Checklist

Observability: traces, metrics, alerts configured
Guardrails: input/output validation, runtime limits
HITL: approval workflows for high-risk actions
Cost controls: per-session budgets, model routing
Deployment: progressive rollout with rollback
Evaluation: CI/CD eval pipeline, regression suite
Documentation: runbook, incident response plan
Security: audit trail, permission tiers, sandboxing

Congratulations!

You've completed the full AgentWay curriculum. You now understand:

Agent fundamentals (P1-P3): basics, tools, prompts
Core systems (P4-P6): memory, evaluation, design patterns
Advanced topics (P7-P12): MCP, RAG, multi-agent, code gen, HITL, production

What's next?

Chat with an AI Mentor — Dive deeper into specific topics
Hands-on Exercises — Apply what you've learned
Claude Code Deep Dive — Study a production-grade agent

Try It: Final Capstone — End-to-End Agent

Build a production-ready agent using the full SDK surface area.

MCP tools + subagents for task decomposition
Eval harness: N-trial report with success rate
sandbox{} for safe execution
All 12 hooks for observability and control
Cost report with token usage breakdown
Session management with rotation before context limit
Public repo with README, tests, and demo

Gate: P12 Complete — L5 Expert Unlocked — Reproducible demo, eval report, cost report, sandbox verified.

Course Complete · Final Capstone

P12 Capstone: Production Agent

Deploy a production-ready agent with MCP, sandbox, and full observability

View Capstone Gallery

Showcase your project and explore community creations

View Your Progress

Review your journey: badges, API coverage, and milestones