Skip to main content

Production Agents

The final frontier: taking AI agents from prototype to production. Master AgentOps, observability, guardrails, cost management, and deployment strategies.

SDK Focus sandbox{} all 12 hooks plugins[] known issues (14)

The PoC-to-Production Gap

Most agent demos work in controlled settings but fail in production. The gap includes:

  • Non-determinism — Same input can produce different outputs
  • Latency — Multi-step agent loops add seconds to minutes of delay
  • Cost — Uncontrolled LLM calls can increase cloud costs by 200%
  • Reliability — Autonomous agents complete only ~50% of common workflows
  • Debugging — 67% of failures are discovered by users, not monitoring

AgentOps Lifecycle

AgentOps is to AI agents what DevOps is to traditional software:

┌─────────────────────────────────────────────────────┐
│                 AgentOps Lifecycle                    │
├─────────────────────────────────────────────────────┤
│                                                      │
│   Plan → Build → Evaluate → Deploy → Monitor → Improve
│    ↑                                            │
│    └────────────── Feedback Loop ───────────────┘
│                                                      │
└─────────────────────────────────────────────────────┘

Plan:    Define outcomes, policies, success metrics
Build:   Small tools, structured prompts, testable agents
Evaluate: Happy paths, edge cases, failure modes
Deploy:  Shadow mode → Canary → Progressive rollout
Monitor: Traces, metrics, alerts, dashboards
Improve: Error taxonomy → Fix prompts/tools/retrieval

Observability Stack

Teams with comprehensive observability debug issues 10x faster (12 min vs 2 hours):

Key Metrics

metrics:
  success_rate:    target: ">99%",   alert: "<95% for 5min"
  p95_latency:     target: "<3s",    alert: ">10s sustained"
  token_usage:     budget: "100K/day", alert: ">150% of budget"
  error_rate:      target: "<1%",    alert: ">5% for 5min"
  escalation_rate: target: "10-15%", alert: ">30%"
  cost_per_task:   target: "<$0.50", alert: ">$2.00"

Distributed Tracing

// Every agent action gets a trace
{
  "trace_id": "abc-123",
  "session_id": "user-session-456",
  "spans": [
    {
      "name": "read_file",
      "duration_ms": 45,
      "tokens_in": 200,
      "tokens_out": 1500,
      "status": "ok"
    },
    {
      "name": "llm_decision",
      "duration_ms": 2100,
      "tokens_in": 3000,
      "tokens_out": 500,
      "model": "claude-sonnet-4",
      "status": "ok"
    },
    {
      "name": "edit_file",
      "duration_ms": 120,
      "status": "ok"
    }
  ],
  "total_cost": "$0.12",
  "success": true
}

Recommended Stack

  • Metrics: Prometheus + Grafana
  • Tracing: Jaeger / OpenTelemetry
  • Agent-specific: LangSmith, AgentOps, Langfuse
  • Logging: Structured JSON with consistent trace IDs

Safety Guardrails

Translate abstract policies into machine-verifiable controls:

guardrails:
  input_validation:
    - max_tokens: 10000
    - blocked_patterns: ["rm -rf /", "DROP TABLE", "sudo"]
    - content_filter: enabled

  output_validation:
    - no_pii_in_responses: true
    - max_actions_per_turn: 10
    - forbidden_tools_in_production: ["delete_database"]

  runtime_limits:
    - max_loop_iterations: 20
    - max_total_tokens: 500000
    - timeout_seconds: 300
    - max_cost_per_session: "$5.00"

  failure_modes:   # Microsoft's taxonomy
    - security_lapses:    → block + alert
    - hallucinated_tools: → validate tool exists
    - planning_loops:     → iteration limit + escalate
    - memory_poisoning:   → input sanitization
Production Sandbox production.ts
import { query } from "@anthropic-ai/claude-agent-sdk";

const response = query({
  prompt: "Run the deployment pipeline",
  options: {
    sandbox: {
      enabled: true,
      autoAllowBashIfSandboxed: true,
      excludedCommands: ["rm", "dd", "mkfs"],
      allowUnsandboxedCommands: false,
      network: {
        enabled: true,
        proxyUrl: "http://proxy.internal:8080"
      }
    },
    plugins: ["./plugins/deploy-validator"],
    settingSources: ["project"],
    permissionMode: "default",

    hooks: {
      SessionStart: async () => { metrics.startTimer(); },
      PreToolUse: async (input) => {
        metrics.trackToolCall(input.toolName);
        return { allow: true };
      },
      PostToolUse: async (input) => {
        if (input.result?.isError) metrics.trackError(input.toolName);
      },
      Error: async (input) => {
        alerting.send(`Agent error: ${input.error_code}`);
      },
      Stop: async () => {
        const report = metrics.getReport();
        console.log(`Duration: ${report.duration}ms`);
        console.log(`Tools: ${report.toolCalls}, Errors: ${report.errors}`);
      }
    }
  }
});

14 Known Issues to Watch

The SDK has 14 documented issues including: CLI not found, MCP type field missing (cryptic exit code 1), Unicode line separators breaking JSON, context overflow making sessions unrecoverable, and subagent cleanup. Each is a production lesson — review them all before deploying.

Deployment Strategies

Progressive Rollout

Stage 1: Shadow Mode (1 week)
  Agent runs in parallel, outputs logged but not used
  Compare with human baseline

Stage 2: Canary (1 week)
  5% of traffic → agent
  Monitor all metrics closely

Stage 3: Progressive (2 weeks)
  5% → 25% → 50% → 100%
  Automated rollback on metric degradation

Stage 4: Full Production
  100% with full observability
  Maintain rollback capability

Rollback & Circuit Breakers

circuit_breaker:
  error_threshold: 10%
  window: 5_minutes
  action: fallback_to_human
  
  # Regularly test rollback
  drill_schedule: "monthly"

Cost Management

Token costs are the #1 surprise in production agents:

  • Budget per task — Set hard limits (e.g., $5 max per session)
  • Model routing — Use fast/cheap models for exploration, capable models for decisions
  • Context pruning — Aggressively manage what's in the context window
  • Caching — Cache tool results, prompt prefixes, and common retrieval results
  • Token monitoring — Teams processing 100M+ daily requests save $50K+/month with proper monitoring

Continuous Evaluation

# CI/CD pipeline for agents
agent_pipeline:
  on_prompt_change:
    - run: eval_suite.yaml
    - assert: success_rate >= 0.95
    - assert: avg_cost <= $0.50
    - assert: p95_latency <= 5s
    
  on_tool_change:
    - run: integration_tests
    - run: safety_tests
    
  weekly:
    - run: full_regression_suite
    - run: adversarial_tests
    - generate: performance_report

Production Readiness Checklist

  • Observability: traces, metrics, alerts configured
  • Guardrails: input/output validation, runtime limits
  • HITL: approval workflows for high-risk actions
  • Cost controls: per-session budgets, model routing
  • Deployment: progressive rollout with rollback
  • Evaluation: CI/CD eval pipeline, regression suite
  • Documentation: runbook, incident response plan
  • Security: audit trail, permission tiers, sandboxing

Congratulations!

You've completed the full AgentWay curriculum. You now understand:

  • Agent fundamentals (P1-P3): basics, tools, prompts
  • Core systems (P4-P6): memory, evaluation, design patterns
  • Advanced topics (P7-P12): MCP, RAG, multi-agent, code gen, HITL, production

What's next?

Try It: Final Capstone — End-to-End Agent

Build a production-ready agent using the full SDK surface area.

  1. MCP tools + subagents for task decomposition
  2. Eval harness: N-trial report with success rate
  3. sandbox{} for safe execution
  4. All 12 hooks for observability and control
  5. Cost report with token usage breakdown
  6. Session management with rotation before context limit
  7. Public repo with README, tests, and demo
Gate: P12 Complete — L5 Expert Unlocked — Reproducible demo, eval report, cost report, sandbox verified.