Why Evals Are the CI/CD Pipeline for Agentic AI

Agentics

Matt Wyman

,

CEO/Co-founder

January 20, 2026

Why CTOs Building Agentic Systems Must Prioritize Eval-Driven Development

Agentic architectures have rapidly become the centerpiece of next-generation AI products. Whether you’re building sophisticated planners, tool-using agents, or multi-agent orchestration frameworks, you’ve likely encountered the profound challenges and subtle failure modes that come with deploying autonomy. Unlike traditional software, agents don’t crash or fail loudly. Instead, they drift, regress, and degrade quietly—sometimes so quietly that issues only surface once real users are affected.

Anthropic’s engineering report, Demystifying evals for AI agents, zeroes in on this pain point. Their argument is simple yet powerful: evaluations (evals) are not a research afterthought but are critical infrastructure for any agentic system that aims to scale reliably. They further champion approaches like LLM-as-judge grading and simulation-based testing, which surface issues and behaviors that static unit tests will never reveal. This article builds on Anthropic’s guidance, offering a practical, CTO-focused roadmap for integrating evals into the core of your agentic development lifecycle.

The Agentic Failure Problem: Why Standard Testing Isn’t Enough

Most CTOs are familiar with the comfort of deterministic testing. You write some unit tests, integration tests, and expect failures to manifest predictably. But agentic systems challenge every assumption of traditional software engineering:

  • Non-determinism: Agents respond differently in similar contexts, influenced by latent state and external signals.

  • Multi-turn interactions: Behaviors unfold across lengthy dialogues, with errors compounding over time.

  • Conditional tool use: Agents choose tools or APIs dynamically, sometimes in unexpected sequences.

  • Creative solutions: LLM-based agents may solve problems in ways that developers never anticipated—sometimes brilliantly, sometimes disastrously.

Anthropic’s report highlights the dangerous symptoms of ignoring these realities: teams are forced into reactive debugging, only addressing problems once production incidents occur. This not only slows iteration and increases risk, but it also obscures regressions that could have been caught early.

Real-World Example: The Silent Regression

Imagine you deploy a customer support agent that is supposed to escalate issues if certain keywords are detected. After a minor model update, the agent starts missing subtle cues, failing to escalate when users indirectly indicate frustration. Because there’s no explicit error, the degradation isn’t noticed until churn metrics spike—a classic case of silent regression.

Evals as CI/CD: Anthropic’s Paradigm Shift

Anthropic reframes evals as the CI/CD pipeline for agentic systems. This perspective is transformative: instead of treating evals as a post-hoc research tool, they become the living contract that binds product, engineering, and research, ensuring agentic systems are robust, scalable, and safe.

  • Baselines for performance: Evals track latency, cost, and error rates, ensuring upgrades don’t introduce inefficiencies.

  • Continuous regression testing: Evals surface behavioral regressions, offering a safety net before production rollouts.

  • Shared language and metrics: Evals provide objective metrics that focus teams on high-impact improvements, not gut feeling.

The Multi-Turn Imperative

One-shot prompt testing—where you evaluate a single agent response—fails to capture the richness and complexity of agentic behavior. Real-world agent systems operate across many turns, often with compounding logic and state changes. Anthropic’s core insight is clear: evals must be multi-turn to be meaningful.

The Three Essential Graders: Code, Models, and Humans

Effective eval-driven development requires a blend of grading approaches, each with unique strengths and weaknesses. Anthropic’s report identifies three complementary graders every CTO must understand:

1. Code-Based Graders: Fast, Deterministic, and Cheap

These graders apply deterministic logic to agent outputs, catching obvious mistakes and enforcing constraints:

  • Tool-call validation: Did the agent call the right tool with the correct parameters?

  • Schema checks: Are outputs conforming to expected data structures?

  • Static analysis: Are there forbidden actions (e.g., data leaks, unsafe API calls)?

  • Budget enforcement: Is the agent staying within latency and token usage budgets?

Best use: Quickly surfacing correctness issues and catching regressions before they reach production. Code-based graders are especially effective for enforcing hard safety and cost boundaries.

2. Model-Based Graders (LLM-as-Judge): Capturing Nuance

Large language models (LLMs) can serve as sophisticated evaluators, grading agent outputs against rubrics that are hard to encode with deterministic logic:

  • Instruction adherence: Does the agent follow specific instructions?

  • Goal completion: Is the user’s objective achieved?

  • Policy compliance: Are outputs aligned with company or legal policies?

  • Quality thresholds: Are responses informative, clear, and contextually suitable?

LLM-as-judge grading is scalable, flexible, and able to capture subtle behaviors that matter to end users. However, these graders must be carefully calibrated to avoid drift and ensure reliability.

3. Human Graders: Calibration and Gold Standard

While expensive and slow, human graders set the gold standard for defining what "good" looks like. Anthropic recommends using human evaluation sparingly, primarily to calibrate model-based judges and refine rubrics, rather than for ongoing large-scale testing.

Key takeaway: The most resilient eval stacks combine all three approaches—code, model, and human—maximizing coverage and mitigating blind spots.

Capability Evals vs. Regression Evals: Defining Ceilings and Floors

Anthropic makes a crucial distinction between two types of evals, each serving a distinct purpose in the agentic development lifecycle:

Capability Evals: What Can This Agent Do?

Capability evals map the ceiling—what your agent is able to achieve, even if imperfectly. At first, pass rates may be low, but these evals drive innovation, uncovering areas for growth and expansion.

Example: Testing a travel booking agent across novel itinerary requests, multi-step rescheduling, or ambiguous user preferences. You want to know how far your agent can stretch.

Regression Evals: Did We Break Anything?

Regression evals protect the floor, surfacing problems that threaten existing functionality. These tests must run continuously, ensuring that improvements don’t come at the expense of reliability.

Example: After a model upgrade, verifying that the agent still books flights accurately, adheres to cancellation policies, and avoids double-bookings.

Best practice: Mature teams implement both types, running them as part of every CI/CD cycle to balance innovation and stability.

The Danger of Eval Saturation: Why 100% Pass Rates Can Hide Problems

Anthropic warns against the seduction of perfection. When an eval suite reaches 100% pass rates (“eval saturation”), it becomes almost useless as a signal. Teams may overlook major capability gains, shifting requirements, or new regressions simply because the suite no longer reflects the realities or needs of production.

The real risk: Static eval suites tend to stagnate. As agents improve, the eval cases become trivial, masking new failure modes and preventing meaningful measurement of progress.

How to Avoid Saturation

  • Regularly refresh eval cases: Update scenarios to reflect evolving user needs and edge cases.

  • Expand coverage: Introduce new, harder examples as agents improve.

  • Rotate rubrics: Adjust grading criteria to match business priorities—don’t settle for static definitions of “success.”

Simulation: The Critical Layer Missing from Most Eval Stacks

While static evals answer, “Was this response acceptable?” simulation-based testing tackles the harder question: “How does this agent behave under realistic, complex, or adversarial conditions?”

What Simulation Introduces

  • Multi-turn conversations: Agents are tested across extended dialogues, surfacing compound errors.

  • Diverse user personas: Simulated users can vary in goals, expertise, and intentional ambiguity.

  • Edge cases and ambiguity: Agents are exposed to unclear, adversarial, or trick questions, revealing brittle logic or unsafe behaviors.

  • Compounding failures: Issues that emerge only after several turns are surfaced before they reach users.

Why it matters: For agentic systems, simulation is indispensable. It exposes dynamics and risks that static tests will never catch.

Example: Simulated Customer Persona Stress Test

A simulated user persona is designed to mimic a frustrated customer seeking a refund. Over several turns, the agent must handle escalating emotion, ambiguous statements, and policy edge cases. Simulation reveals not just single-response failures, but whether the agent can consistently navigate the scenario without missteps, escalation delays, or policy violations.

LLM-as-Judge + Simulated Personas: Modern Agent Testing, Done Right

Pairing simulation with LLM-as-judge creates a robust, scalable evaluation pipeline:

  • Full conversational context: Simulated personas interact with the agent across multiple turns, providing realistic inputs and reactions.

  • Focused, reliable grading: LLM judges assess specific metrics—goal completion, policy adherence, termination correctness, data leakage—ensuring each judge remains narrow and trustworthy.

  • No monolithic scoring: As Anthropic advises, avoid vague “overall quality” prompts. Instead, use targeted evaluators for each critical behavior.

Concrete Example: Evaluating a Healthcare Assistant Agent

Suppose you’re developing an agent for triaging patient symptoms. Simulated personas represent patients with varying severity and clarity. LLM-based judges focus on:

  • Diagnostic accuracy: Did the agent recommend the correct next steps?

  • Policy adherence: Did the agent avoid recommending forbidden treatments?

  • Termination correctness: Did the agent hand off to a human at the right moment?

This method surfaces weaknesses that would be impossible to catch with static, single-turn tests or generic scoring rubrics.

Building a Production-Grade Agent Eval Stack: CTO’s Action Plan

For agentic systems, eval-driven development isn’t optional—it’s existential. Anthropic’s guidance, sharpened by real-world deployment experience, points to a clear, actionable blueprint:

1. Code-Based Checks: Determinism and Cost Control

Run deterministic checks on every agent action:

  • Validate API/tool calls

  • Enforce budget limits (latency, token usage)

  • Ensure data schema compliance

These checks are fast, clear, and essential for reliability and resource management.

2. Simulation-Driven Evals: Behavioral Stress Testing

Deploy simulated personas to stress test agents in realistic, multi-turn contexts:

  • Vary user goals, emotions, and ambiguity

  • Simulate adversarial behavior and edge cases

  • Surface failures that emerge only after several interactions

Simulation is the only way to expose compound, real-world risks before they reach production.

3. LLM-as-Judge: Nuanced, Scalable Grading

Use focused LLM-based graders to evaluate:

  • Goal achievement

  • Policy adherence

  • Quality and clarity of responses

  • Termination logic and handoff correctness

Calibrate LLM graders against human raters, regularly updating rubrics to reflect evolving needs.

4. Continuous Refresh: Protect Against Eval Saturation

  • Regularly update test cases and rubrics

  • Introduce harder examples as agents improve

  • Expand coverage to match changing user requirements

Your eval suite should evolve alongside your agent.

CTO Checklist for Eval-Driven Agent Engineering

  • Integrate code-based checks into every CI/CD cycle

  • Build simulation frameworks to test agents at scale

  • Deploy LLM-as-judge grading for nuanced evaluation

  • Include human-in-the-loop calibration for new scenarios

  • Continuously refresh evals to avoid saturation

  • Track both capability ceilings and regression floors

Anthropic’s Conclusion: Evals and Simulation Are Agentic Infrastructure

The message from Demystifying evals for AI agents is unambiguous. Evals are not optional—they are the only way to ship autonomy safely and scalably. For CTOs building agentic systems, evaluative infrastructure is as critical as the agent models themselves.

  • Evals define success: They set the standards, boundaries, and metrics that anchor your system.

  • Simulation reveals failure: It stress-tests your agent, surfacing hard-to-predict risks before users ever encounter them.

  • Combined, they build trustworthy autonomy: Without evals and simulation, agents are unscalable, unreliable, and potentially unsafe.

Final Takeaways and Immediate Next Steps

Agentic systems represent the future of AI-driven products, but they cannot be trusted or scaled without rigorous eval-driven development. Anthropic’s pragmatic approach offers actionable guidance for CTOs and technical founders:

  • Treat evals as first-class infrastructure, integral to your deployment pipeline.

  • Blend code-based, LLM-as-judge, and simulation-driven grading for comprehensive coverage.

  • Avoid eval saturation by regularly refreshing your test suites.

  • Use simulation to expose real-world, multi-turn risks that static tests will never catch.

If you’re leading a startup building agentic architectures, invest in your eval stack as aggressively as you do in your core models. The difference between scalable, safe autonomy and brittle, unpredictable behavior starts—and often ends—with how you evaluate and iterate on your agents.

Further Reading

Evals and simulation are not just research tools; they are the backbone of agentic engineering. Your future users—and your business—depend on getting this right.

Why CTOs Building Agentic Systems Must Prioritize Eval-Driven Development

Agentic architectures have rapidly become the centerpiece of next-generation AI products. Whether you’re building sophisticated planners, tool-using agents, or multi-agent orchestration frameworks, you’ve likely encountered the profound challenges and subtle failure modes that come with deploying autonomy. Unlike traditional software, agents don’t crash or fail loudly. Instead, they drift, regress, and degrade quietly—sometimes so quietly that issues only surface once real users are affected.

Anthropic’s engineering report, Demystifying evals for AI agents, zeroes in on this pain point. Their argument is simple yet powerful: evaluations (evals) are not a research afterthought but are critical infrastructure for any agentic system that aims to scale reliably. They further champion approaches like LLM-as-judge grading and simulation-based testing, which surface issues and behaviors that static unit tests will never reveal. This article builds on Anthropic’s guidance, offering a practical, CTO-focused roadmap for integrating evals into the core of your agentic development lifecycle.

The Agentic Failure Problem: Why Standard Testing Isn’t Enough

Most CTOs are familiar with the comfort of deterministic testing. You write some unit tests, integration tests, and expect failures to manifest predictably. But agentic systems challenge every assumption of traditional software engineering:

  • Non-determinism: Agents respond differently in similar contexts, influenced by latent state and external signals.

  • Multi-turn interactions: Behaviors unfold across lengthy dialogues, with errors compounding over time.

  • Conditional tool use: Agents choose tools or APIs dynamically, sometimes in unexpected sequences.

  • Creative solutions: LLM-based agents may solve problems in ways that developers never anticipated—sometimes brilliantly, sometimes disastrously.

Anthropic’s report highlights the dangerous symptoms of ignoring these realities: teams are forced into reactive debugging, only addressing problems once production incidents occur. This not only slows iteration and increases risk, but it also obscures regressions that could have been caught early.

Real-World Example: The Silent Regression

Imagine you deploy a customer support agent that is supposed to escalate issues if certain keywords are detected. After a minor model update, the agent starts missing subtle cues, failing to escalate when users indirectly indicate frustration. Because there’s no explicit error, the degradation isn’t noticed until churn metrics spike—a classic case of silent regression.

Evals as CI/CD: Anthropic’s Paradigm Shift

Anthropic reframes evals as the CI/CD pipeline for agentic systems. This perspective is transformative: instead of treating evals as a post-hoc research tool, they become the living contract that binds product, engineering, and research, ensuring agentic systems are robust, scalable, and safe.

  • Baselines for performance: Evals track latency, cost, and error rates, ensuring upgrades don’t introduce inefficiencies.

  • Continuous regression testing: Evals surface behavioral regressions, offering a safety net before production rollouts.

  • Shared language and metrics: Evals provide objective metrics that focus teams on high-impact improvements, not gut feeling.

The Multi-Turn Imperative

One-shot prompt testing—where you evaluate a single agent response—fails to capture the richness and complexity of agentic behavior. Real-world agent systems operate across many turns, often with compounding logic and state changes. Anthropic’s core insight is clear: evals must be multi-turn to be meaningful.

The Three Essential Graders: Code, Models, and Humans

Effective eval-driven development requires a blend of grading approaches, each with unique strengths and weaknesses. Anthropic’s report identifies three complementary graders every CTO must understand:

1. Code-Based Graders: Fast, Deterministic, and Cheap

These graders apply deterministic logic to agent outputs, catching obvious mistakes and enforcing constraints:

  • Tool-call validation: Did the agent call the right tool with the correct parameters?

  • Schema checks: Are outputs conforming to expected data structures?

  • Static analysis: Are there forbidden actions (e.g., data leaks, unsafe API calls)?

  • Budget enforcement: Is the agent staying within latency and token usage budgets?

Best use: Quickly surfacing correctness issues and catching regressions before they reach production. Code-based graders are especially effective for enforcing hard safety and cost boundaries.

2. Model-Based Graders (LLM-as-Judge): Capturing Nuance

Large language models (LLMs) can serve as sophisticated evaluators, grading agent outputs against rubrics that are hard to encode with deterministic logic:

  • Instruction adherence: Does the agent follow specific instructions?

  • Goal completion: Is the user’s objective achieved?

  • Policy compliance: Are outputs aligned with company or legal policies?

  • Quality thresholds: Are responses informative, clear, and contextually suitable?

LLM-as-judge grading is scalable, flexible, and able to capture subtle behaviors that matter to end users. However, these graders must be carefully calibrated to avoid drift and ensure reliability.

3. Human Graders: Calibration and Gold Standard

While expensive and slow, human graders set the gold standard for defining what "good" looks like. Anthropic recommends using human evaluation sparingly, primarily to calibrate model-based judges and refine rubrics, rather than for ongoing large-scale testing.

Key takeaway: The most resilient eval stacks combine all three approaches—code, model, and human—maximizing coverage and mitigating blind spots.

Capability Evals vs. Regression Evals: Defining Ceilings and Floors

Anthropic makes a crucial distinction between two types of evals, each serving a distinct purpose in the agentic development lifecycle:

Capability Evals: What Can This Agent Do?

Capability evals map the ceiling—what your agent is able to achieve, even if imperfectly. At first, pass rates may be low, but these evals drive innovation, uncovering areas for growth and expansion.

Example: Testing a travel booking agent across novel itinerary requests, multi-step rescheduling, or ambiguous user preferences. You want to know how far your agent can stretch.

Regression Evals: Did We Break Anything?

Regression evals protect the floor, surfacing problems that threaten existing functionality. These tests must run continuously, ensuring that improvements don’t come at the expense of reliability.

Example: After a model upgrade, verifying that the agent still books flights accurately, adheres to cancellation policies, and avoids double-bookings.

Best practice: Mature teams implement both types, running them as part of every CI/CD cycle to balance innovation and stability.

The Danger of Eval Saturation: Why 100% Pass Rates Can Hide Problems

Anthropic warns against the seduction of perfection. When an eval suite reaches 100% pass rates (“eval saturation”), it becomes almost useless as a signal. Teams may overlook major capability gains, shifting requirements, or new regressions simply because the suite no longer reflects the realities or needs of production.

The real risk: Static eval suites tend to stagnate. As agents improve, the eval cases become trivial, masking new failure modes and preventing meaningful measurement of progress.

How to Avoid Saturation

  • Regularly refresh eval cases: Update scenarios to reflect evolving user needs and edge cases.

  • Expand coverage: Introduce new, harder examples as agents improve.

  • Rotate rubrics: Adjust grading criteria to match business priorities—don’t settle for static definitions of “success.”

Simulation: The Critical Layer Missing from Most Eval Stacks

While static evals answer, “Was this response acceptable?” simulation-based testing tackles the harder question: “How does this agent behave under realistic, complex, or adversarial conditions?”

What Simulation Introduces

  • Multi-turn conversations: Agents are tested across extended dialogues, surfacing compound errors.

  • Diverse user personas: Simulated users can vary in goals, expertise, and intentional ambiguity.

  • Edge cases and ambiguity: Agents are exposed to unclear, adversarial, or trick questions, revealing brittle logic or unsafe behaviors.

  • Compounding failures: Issues that emerge only after several turns are surfaced before they reach users.

Why it matters: For agentic systems, simulation is indispensable. It exposes dynamics and risks that static tests will never catch.

Example: Simulated Customer Persona Stress Test

A simulated user persona is designed to mimic a frustrated customer seeking a refund. Over several turns, the agent must handle escalating emotion, ambiguous statements, and policy edge cases. Simulation reveals not just single-response failures, but whether the agent can consistently navigate the scenario without missteps, escalation delays, or policy violations.

LLM-as-Judge + Simulated Personas: Modern Agent Testing, Done Right

Pairing simulation with LLM-as-judge creates a robust, scalable evaluation pipeline:

  • Full conversational context: Simulated personas interact with the agent across multiple turns, providing realistic inputs and reactions.

  • Focused, reliable grading: LLM judges assess specific metrics—goal completion, policy adherence, termination correctness, data leakage—ensuring each judge remains narrow and trustworthy.

  • No monolithic scoring: As Anthropic advises, avoid vague “overall quality” prompts. Instead, use targeted evaluators for each critical behavior.

Concrete Example: Evaluating a Healthcare Assistant Agent

Suppose you’re developing an agent for triaging patient symptoms. Simulated personas represent patients with varying severity and clarity. LLM-based judges focus on:

  • Diagnostic accuracy: Did the agent recommend the correct next steps?

  • Policy adherence: Did the agent avoid recommending forbidden treatments?

  • Termination correctness: Did the agent hand off to a human at the right moment?

This method surfaces weaknesses that would be impossible to catch with static, single-turn tests or generic scoring rubrics.

Building a Production-Grade Agent Eval Stack: CTO’s Action Plan

For agentic systems, eval-driven development isn’t optional—it’s existential. Anthropic’s guidance, sharpened by real-world deployment experience, points to a clear, actionable blueprint:

1. Code-Based Checks: Determinism and Cost Control

Run deterministic checks on every agent action:

  • Validate API/tool calls

  • Enforce budget limits (latency, token usage)

  • Ensure data schema compliance

These checks are fast, clear, and essential for reliability and resource management.

2. Simulation-Driven Evals: Behavioral Stress Testing

Deploy simulated personas to stress test agents in realistic, multi-turn contexts:

  • Vary user goals, emotions, and ambiguity

  • Simulate adversarial behavior and edge cases

  • Surface failures that emerge only after several interactions

Simulation is the only way to expose compound, real-world risks before they reach production.

3. LLM-as-Judge: Nuanced, Scalable Grading

Use focused LLM-based graders to evaluate:

  • Goal achievement

  • Policy adherence

  • Quality and clarity of responses

  • Termination logic and handoff correctness

Calibrate LLM graders against human raters, regularly updating rubrics to reflect evolving needs.

4. Continuous Refresh: Protect Against Eval Saturation

  • Regularly update test cases and rubrics

  • Introduce harder examples as agents improve

  • Expand coverage to match changing user requirements

Your eval suite should evolve alongside your agent.

CTO Checklist for Eval-Driven Agent Engineering

  • Integrate code-based checks into every CI/CD cycle

  • Build simulation frameworks to test agents at scale

  • Deploy LLM-as-judge grading for nuanced evaluation

  • Include human-in-the-loop calibration for new scenarios

  • Continuously refresh evals to avoid saturation

  • Track both capability ceilings and regression floors

Anthropic’s Conclusion: Evals and Simulation Are Agentic Infrastructure

The message from Demystifying evals for AI agents is unambiguous. Evals are not optional—they are the only way to ship autonomy safely and scalably. For CTOs building agentic systems, evaluative infrastructure is as critical as the agent models themselves.

  • Evals define success: They set the standards, boundaries, and metrics that anchor your system.

  • Simulation reveals failure: It stress-tests your agent, surfacing hard-to-predict risks before users ever encounter them.

  • Combined, they build trustworthy autonomy: Without evals and simulation, agents are unscalable, unreliable, and potentially unsafe.

Final Takeaways and Immediate Next Steps

Agentic systems represent the future of AI-driven products, but they cannot be trusted or scaled without rigorous eval-driven development. Anthropic’s pragmatic approach offers actionable guidance for CTOs and technical founders:

  • Treat evals as first-class infrastructure, integral to your deployment pipeline.

  • Blend code-based, LLM-as-judge, and simulation-driven grading for comprehensive coverage.

  • Avoid eval saturation by regularly refreshing your test suites.

  • Use simulation to expose real-world, multi-turn risks that static tests will never catch.

If you’re leading a startup building agentic architectures, invest in your eval stack as aggressively as you do in your core models. The difference between scalable, safe autonomy and brittle, unpredictable behavior starts—and often ends—with how you evaluate and iterate on your agents.

Further Reading

Evals and simulation are not just research tools; they are the backbone of agentic engineering. Your future users—and your business—depend on getting this right.

Why CTOs Building Agentic Systems Must Prioritize Eval-Driven Development

Agentic architectures have rapidly become the centerpiece of next-generation AI products. Whether you’re building sophisticated planners, tool-using agents, or multi-agent orchestration frameworks, you’ve likely encountered the profound challenges and subtle failure modes that come with deploying autonomy. Unlike traditional software, agents don’t crash or fail loudly. Instead, they drift, regress, and degrade quietly—sometimes so quietly that issues only surface once real users are affected.

Anthropic’s engineering report, Demystifying evals for AI agents, zeroes in on this pain point. Their argument is simple yet powerful: evaluations (evals) are not a research afterthought but are critical infrastructure for any agentic system that aims to scale reliably. They further champion approaches like LLM-as-judge grading and simulation-based testing, which surface issues and behaviors that static unit tests will never reveal. This article builds on Anthropic’s guidance, offering a practical, CTO-focused roadmap for integrating evals into the core of your agentic development lifecycle.

The Agentic Failure Problem: Why Standard Testing Isn’t Enough

Most CTOs are familiar with the comfort of deterministic testing. You write some unit tests, integration tests, and expect failures to manifest predictably. But agentic systems challenge every assumption of traditional software engineering:

  • Non-determinism: Agents respond differently in similar contexts, influenced by latent state and external signals.

  • Multi-turn interactions: Behaviors unfold across lengthy dialogues, with errors compounding over time.

  • Conditional tool use: Agents choose tools or APIs dynamically, sometimes in unexpected sequences.

  • Creative solutions: LLM-based agents may solve problems in ways that developers never anticipated—sometimes brilliantly, sometimes disastrously.

Anthropic’s report highlights the dangerous symptoms of ignoring these realities: teams are forced into reactive debugging, only addressing problems once production incidents occur. This not only slows iteration and increases risk, but it also obscures regressions that could have been caught early.

Real-World Example: The Silent Regression

Imagine you deploy a customer support agent that is supposed to escalate issues if certain keywords are detected. After a minor model update, the agent starts missing subtle cues, failing to escalate when users indirectly indicate frustration. Because there’s no explicit error, the degradation isn’t noticed until churn metrics spike—a classic case of silent regression.

Evals as CI/CD: Anthropic’s Paradigm Shift

Anthropic reframes evals as the CI/CD pipeline for agentic systems. This perspective is transformative: instead of treating evals as a post-hoc research tool, they become the living contract that binds product, engineering, and research, ensuring agentic systems are robust, scalable, and safe.

  • Baselines for performance: Evals track latency, cost, and error rates, ensuring upgrades don’t introduce inefficiencies.

  • Continuous regression testing: Evals surface behavioral regressions, offering a safety net before production rollouts.

  • Shared language and metrics: Evals provide objective metrics that focus teams on high-impact improvements, not gut feeling.

The Multi-Turn Imperative

One-shot prompt testing—where you evaluate a single agent response—fails to capture the richness and complexity of agentic behavior. Real-world agent systems operate across many turns, often with compounding logic and state changes. Anthropic’s core insight is clear: evals must be multi-turn to be meaningful.

The Three Essential Graders: Code, Models, and Humans

Effective eval-driven development requires a blend of grading approaches, each with unique strengths and weaknesses. Anthropic’s report identifies three complementary graders every CTO must understand:

1. Code-Based Graders: Fast, Deterministic, and Cheap

These graders apply deterministic logic to agent outputs, catching obvious mistakes and enforcing constraints:

  • Tool-call validation: Did the agent call the right tool with the correct parameters?

  • Schema checks: Are outputs conforming to expected data structures?

  • Static analysis: Are there forbidden actions (e.g., data leaks, unsafe API calls)?

  • Budget enforcement: Is the agent staying within latency and token usage budgets?

Best use: Quickly surfacing correctness issues and catching regressions before they reach production. Code-based graders are especially effective for enforcing hard safety and cost boundaries.

2. Model-Based Graders (LLM-as-Judge): Capturing Nuance

Large language models (LLMs) can serve as sophisticated evaluators, grading agent outputs against rubrics that are hard to encode with deterministic logic:

  • Instruction adherence: Does the agent follow specific instructions?

  • Goal completion: Is the user’s objective achieved?

  • Policy compliance: Are outputs aligned with company or legal policies?

  • Quality thresholds: Are responses informative, clear, and contextually suitable?

LLM-as-judge grading is scalable, flexible, and able to capture subtle behaviors that matter to end users. However, these graders must be carefully calibrated to avoid drift and ensure reliability.

3. Human Graders: Calibration and Gold Standard

While expensive and slow, human graders set the gold standard for defining what "good" looks like. Anthropic recommends using human evaluation sparingly, primarily to calibrate model-based judges and refine rubrics, rather than for ongoing large-scale testing.

Key takeaway: The most resilient eval stacks combine all three approaches—code, model, and human—maximizing coverage and mitigating blind spots.

Capability Evals vs. Regression Evals: Defining Ceilings and Floors

Anthropic makes a crucial distinction between two types of evals, each serving a distinct purpose in the agentic development lifecycle:

Capability Evals: What Can This Agent Do?

Capability evals map the ceiling—what your agent is able to achieve, even if imperfectly. At first, pass rates may be low, but these evals drive innovation, uncovering areas for growth and expansion.

Example: Testing a travel booking agent across novel itinerary requests, multi-step rescheduling, or ambiguous user preferences. You want to know how far your agent can stretch.

Regression Evals: Did We Break Anything?

Regression evals protect the floor, surfacing problems that threaten existing functionality. These tests must run continuously, ensuring that improvements don’t come at the expense of reliability.

Example: After a model upgrade, verifying that the agent still books flights accurately, adheres to cancellation policies, and avoids double-bookings.

Best practice: Mature teams implement both types, running them as part of every CI/CD cycle to balance innovation and stability.

The Danger of Eval Saturation: Why 100% Pass Rates Can Hide Problems

Anthropic warns against the seduction of perfection. When an eval suite reaches 100% pass rates (“eval saturation”), it becomes almost useless as a signal. Teams may overlook major capability gains, shifting requirements, or new regressions simply because the suite no longer reflects the realities or needs of production.

The real risk: Static eval suites tend to stagnate. As agents improve, the eval cases become trivial, masking new failure modes and preventing meaningful measurement of progress.

How to Avoid Saturation

  • Regularly refresh eval cases: Update scenarios to reflect evolving user needs and edge cases.

  • Expand coverage: Introduce new, harder examples as agents improve.

  • Rotate rubrics: Adjust grading criteria to match business priorities—don’t settle for static definitions of “success.”

Simulation: The Critical Layer Missing from Most Eval Stacks

While static evals answer, “Was this response acceptable?” simulation-based testing tackles the harder question: “How does this agent behave under realistic, complex, or adversarial conditions?”

What Simulation Introduces

  • Multi-turn conversations: Agents are tested across extended dialogues, surfacing compound errors.

  • Diverse user personas: Simulated users can vary in goals, expertise, and intentional ambiguity.

  • Edge cases and ambiguity: Agents are exposed to unclear, adversarial, or trick questions, revealing brittle logic or unsafe behaviors.

  • Compounding failures: Issues that emerge only after several turns are surfaced before they reach users.

Why it matters: For agentic systems, simulation is indispensable. It exposes dynamics and risks that static tests will never catch.

Example: Simulated Customer Persona Stress Test

A simulated user persona is designed to mimic a frustrated customer seeking a refund. Over several turns, the agent must handle escalating emotion, ambiguous statements, and policy edge cases. Simulation reveals not just single-response failures, but whether the agent can consistently navigate the scenario without missteps, escalation delays, or policy violations.

LLM-as-Judge + Simulated Personas: Modern Agent Testing, Done Right

Pairing simulation with LLM-as-judge creates a robust, scalable evaluation pipeline:

  • Full conversational context: Simulated personas interact with the agent across multiple turns, providing realistic inputs and reactions.

  • Focused, reliable grading: LLM judges assess specific metrics—goal completion, policy adherence, termination correctness, data leakage—ensuring each judge remains narrow and trustworthy.

  • No monolithic scoring: As Anthropic advises, avoid vague “overall quality” prompts. Instead, use targeted evaluators for each critical behavior.

Concrete Example: Evaluating a Healthcare Assistant Agent

Suppose you’re developing an agent for triaging patient symptoms. Simulated personas represent patients with varying severity and clarity. LLM-based judges focus on:

  • Diagnostic accuracy: Did the agent recommend the correct next steps?

  • Policy adherence: Did the agent avoid recommending forbidden treatments?

  • Termination correctness: Did the agent hand off to a human at the right moment?

This method surfaces weaknesses that would be impossible to catch with static, single-turn tests or generic scoring rubrics.

Building a Production-Grade Agent Eval Stack: CTO’s Action Plan

For agentic systems, eval-driven development isn’t optional—it’s existential. Anthropic’s guidance, sharpened by real-world deployment experience, points to a clear, actionable blueprint:

1. Code-Based Checks: Determinism and Cost Control

Run deterministic checks on every agent action:

  • Validate API/tool calls

  • Enforce budget limits (latency, token usage)

  • Ensure data schema compliance

These checks are fast, clear, and essential for reliability and resource management.

2. Simulation-Driven Evals: Behavioral Stress Testing

Deploy simulated personas to stress test agents in realistic, multi-turn contexts:

  • Vary user goals, emotions, and ambiguity

  • Simulate adversarial behavior and edge cases

  • Surface failures that emerge only after several interactions

Simulation is the only way to expose compound, real-world risks before they reach production.

3. LLM-as-Judge: Nuanced, Scalable Grading

Use focused LLM-based graders to evaluate:

  • Goal achievement

  • Policy adherence

  • Quality and clarity of responses

  • Termination logic and handoff correctness

Calibrate LLM graders against human raters, regularly updating rubrics to reflect evolving needs.

4. Continuous Refresh: Protect Against Eval Saturation

  • Regularly update test cases and rubrics

  • Introduce harder examples as agents improve

  • Expand coverage to match changing user requirements

Your eval suite should evolve alongside your agent.

CTO Checklist for Eval-Driven Agent Engineering

  • Integrate code-based checks into every CI/CD cycle

  • Build simulation frameworks to test agents at scale

  • Deploy LLM-as-judge grading for nuanced evaluation

  • Include human-in-the-loop calibration for new scenarios

  • Continuously refresh evals to avoid saturation

  • Track both capability ceilings and regression floors

Anthropic’s Conclusion: Evals and Simulation Are Agentic Infrastructure

The message from Demystifying evals for AI agents is unambiguous. Evals are not optional—they are the only way to ship autonomy safely and scalably. For CTOs building agentic systems, evaluative infrastructure is as critical as the agent models themselves.

  • Evals define success: They set the standards, boundaries, and metrics that anchor your system.

  • Simulation reveals failure: It stress-tests your agent, surfacing hard-to-predict risks before users ever encounter them.

  • Combined, they build trustworthy autonomy: Without evals and simulation, agents are unscalable, unreliable, and potentially unsafe.

Final Takeaways and Immediate Next Steps

Agentic systems represent the future of AI-driven products, but they cannot be trusted or scaled without rigorous eval-driven development. Anthropic’s pragmatic approach offers actionable guidance for CTOs and technical founders:

  • Treat evals as first-class infrastructure, integral to your deployment pipeline.

  • Blend code-based, LLM-as-judge, and simulation-driven grading for comprehensive coverage.

  • Avoid eval saturation by regularly refreshing your test suites.

  • Use simulation to expose real-world, multi-turn risks that static tests will never catch.

If you’re leading a startup building agentic architectures, invest in your eval stack as aggressively as you do in your core models. The difference between scalable, safe autonomy and brittle, unpredictable behavior starts—and often ends—with how you evaluate and iterate on your agents.

Further Reading

Evals and simulation are not just research tools; they are the backbone of agentic engineering. Your future users—and your business—depend on getting this right.

Why CTOs Building Agentic Systems Must Prioritize Eval-Driven Development

Agentic architectures have rapidly become the centerpiece of next-generation AI products. Whether you’re building sophisticated planners, tool-using agents, or multi-agent orchestration frameworks, you’ve likely encountered the profound challenges and subtle failure modes that come with deploying autonomy. Unlike traditional software, agents don’t crash or fail loudly. Instead, they drift, regress, and degrade quietly—sometimes so quietly that issues only surface once real users are affected.

Anthropic’s engineering report, Demystifying evals for AI agents, zeroes in on this pain point. Their argument is simple yet powerful: evaluations (evals) are not a research afterthought but are critical infrastructure for any agentic system that aims to scale reliably. They further champion approaches like LLM-as-judge grading and simulation-based testing, which surface issues and behaviors that static unit tests will never reveal. This article builds on Anthropic’s guidance, offering a practical, CTO-focused roadmap for integrating evals into the core of your agentic development lifecycle.

The Agentic Failure Problem: Why Standard Testing Isn’t Enough

Most CTOs are familiar with the comfort of deterministic testing. You write some unit tests, integration tests, and expect failures to manifest predictably. But agentic systems challenge every assumption of traditional software engineering:

  • Non-determinism: Agents respond differently in similar contexts, influenced by latent state and external signals.

  • Multi-turn interactions: Behaviors unfold across lengthy dialogues, with errors compounding over time.

  • Conditional tool use: Agents choose tools or APIs dynamically, sometimes in unexpected sequences.

  • Creative solutions: LLM-based agents may solve problems in ways that developers never anticipated—sometimes brilliantly, sometimes disastrously.

Anthropic’s report highlights the dangerous symptoms of ignoring these realities: teams are forced into reactive debugging, only addressing problems once production incidents occur. This not only slows iteration and increases risk, but it also obscures regressions that could have been caught early.

Real-World Example: The Silent Regression

Imagine you deploy a customer support agent that is supposed to escalate issues if certain keywords are detected. After a minor model update, the agent starts missing subtle cues, failing to escalate when users indirectly indicate frustration. Because there’s no explicit error, the degradation isn’t noticed until churn metrics spike—a classic case of silent regression.

Evals as CI/CD: Anthropic’s Paradigm Shift

Anthropic reframes evals as the CI/CD pipeline for agentic systems. This perspective is transformative: instead of treating evals as a post-hoc research tool, they become the living contract that binds product, engineering, and research, ensuring agentic systems are robust, scalable, and safe.

  • Baselines for performance: Evals track latency, cost, and error rates, ensuring upgrades don’t introduce inefficiencies.

  • Continuous regression testing: Evals surface behavioral regressions, offering a safety net before production rollouts.

  • Shared language and metrics: Evals provide objective metrics that focus teams on high-impact improvements, not gut feeling.

The Multi-Turn Imperative

One-shot prompt testing—where you evaluate a single agent response—fails to capture the richness and complexity of agentic behavior. Real-world agent systems operate across many turns, often with compounding logic and state changes. Anthropic’s core insight is clear: evals must be multi-turn to be meaningful.

The Three Essential Graders: Code, Models, and Humans

Effective eval-driven development requires a blend of grading approaches, each with unique strengths and weaknesses. Anthropic’s report identifies three complementary graders every CTO must understand:

1. Code-Based Graders: Fast, Deterministic, and Cheap

These graders apply deterministic logic to agent outputs, catching obvious mistakes and enforcing constraints:

  • Tool-call validation: Did the agent call the right tool with the correct parameters?

  • Schema checks: Are outputs conforming to expected data structures?

  • Static analysis: Are there forbidden actions (e.g., data leaks, unsafe API calls)?

  • Budget enforcement: Is the agent staying within latency and token usage budgets?

Best use: Quickly surfacing correctness issues and catching regressions before they reach production. Code-based graders are especially effective for enforcing hard safety and cost boundaries.

2. Model-Based Graders (LLM-as-Judge): Capturing Nuance

Large language models (LLMs) can serve as sophisticated evaluators, grading agent outputs against rubrics that are hard to encode with deterministic logic:

  • Instruction adherence: Does the agent follow specific instructions?

  • Goal completion: Is the user’s objective achieved?

  • Policy compliance: Are outputs aligned with company or legal policies?

  • Quality thresholds: Are responses informative, clear, and contextually suitable?

LLM-as-judge grading is scalable, flexible, and able to capture subtle behaviors that matter to end users. However, these graders must be carefully calibrated to avoid drift and ensure reliability.

3. Human Graders: Calibration and Gold Standard

While expensive and slow, human graders set the gold standard for defining what "good" looks like. Anthropic recommends using human evaluation sparingly, primarily to calibrate model-based judges and refine rubrics, rather than for ongoing large-scale testing.

Key takeaway: The most resilient eval stacks combine all three approaches—code, model, and human—maximizing coverage and mitigating blind spots.

Capability Evals vs. Regression Evals: Defining Ceilings and Floors

Anthropic makes a crucial distinction between two types of evals, each serving a distinct purpose in the agentic development lifecycle:

Capability Evals: What Can This Agent Do?

Capability evals map the ceiling—what your agent is able to achieve, even if imperfectly. At first, pass rates may be low, but these evals drive innovation, uncovering areas for growth and expansion.

Example: Testing a travel booking agent across novel itinerary requests, multi-step rescheduling, or ambiguous user preferences. You want to know how far your agent can stretch.

Regression Evals: Did We Break Anything?

Regression evals protect the floor, surfacing problems that threaten existing functionality. These tests must run continuously, ensuring that improvements don’t come at the expense of reliability.

Example: After a model upgrade, verifying that the agent still books flights accurately, adheres to cancellation policies, and avoids double-bookings.

Best practice: Mature teams implement both types, running them as part of every CI/CD cycle to balance innovation and stability.

The Danger of Eval Saturation: Why 100% Pass Rates Can Hide Problems

Anthropic warns against the seduction of perfection. When an eval suite reaches 100% pass rates (“eval saturation”), it becomes almost useless as a signal. Teams may overlook major capability gains, shifting requirements, or new regressions simply because the suite no longer reflects the realities or needs of production.

The real risk: Static eval suites tend to stagnate. As agents improve, the eval cases become trivial, masking new failure modes and preventing meaningful measurement of progress.

How to Avoid Saturation

  • Regularly refresh eval cases: Update scenarios to reflect evolving user needs and edge cases.

  • Expand coverage: Introduce new, harder examples as agents improve.

  • Rotate rubrics: Adjust grading criteria to match business priorities—don’t settle for static definitions of “success.”

Simulation: The Critical Layer Missing from Most Eval Stacks

While static evals answer, “Was this response acceptable?” simulation-based testing tackles the harder question: “How does this agent behave under realistic, complex, or adversarial conditions?”

What Simulation Introduces

  • Multi-turn conversations: Agents are tested across extended dialogues, surfacing compound errors.

  • Diverse user personas: Simulated users can vary in goals, expertise, and intentional ambiguity.

  • Edge cases and ambiguity: Agents are exposed to unclear, adversarial, or trick questions, revealing brittle logic or unsafe behaviors.

  • Compounding failures: Issues that emerge only after several turns are surfaced before they reach users.

Why it matters: For agentic systems, simulation is indispensable. It exposes dynamics and risks that static tests will never catch.

Example: Simulated Customer Persona Stress Test

A simulated user persona is designed to mimic a frustrated customer seeking a refund. Over several turns, the agent must handle escalating emotion, ambiguous statements, and policy edge cases. Simulation reveals not just single-response failures, but whether the agent can consistently navigate the scenario without missteps, escalation delays, or policy violations.

LLM-as-Judge + Simulated Personas: Modern Agent Testing, Done Right

Pairing simulation with LLM-as-judge creates a robust, scalable evaluation pipeline:

  • Full conversational context: Simulated personas interact with the agent across multiple turns, providing realistic inputs and reactions.

  • Focused, reliable grading: LLM judges assess specific metrics—goal completion, policy adherence, termination correctness, data leakage—ensuring each judge remains narrow and trustworthy.

  • No monolithic scoring: As Anthropic advises, avoid vague “overall quality” prompts. Instead, use targeted evaluators for each critical behavior.

Concrete Example: Evaluating a Healthcare Assistant Agent

Suppose you’re developing an agent for triaging patient symptoms. Simulated personas represent patients with varying severity and clarity. LLM-based judges focus on:

  • Diagnostic accuracy: Did the agent recommend the correct next steps?

  • Policy adherence: Did the agent avoid recommending forbidden treatments?

  • Termination correctness: Did the agent hand off to a human at the right moment?

This method surfaces weaknesses that would be impossible to catch with static, single-turn tests or generic scoring rubrics.

Building a Production-Grade Agent Eval Stack: CTO’s Action Plan

For agentic systems, eval-driven development isn’t optional—it’s existential. Anthropic’s guidance, sharpened by real-world deployment experience, points to a clear, actionable blueprint:

1. Code-Based Checks: Determinism and Cost Control

Run deterministic checks on every agent action:

  • Validate API/tool calls

  • Enforce budget limits (latency, token usage)

  • Ensure data schema compliance

These checks are fast, clear, and essential for reliability and resource management.

2. Simulation-Driven Evals: Behavioral Stress Testing

Deploy simulated personas to stress test agents in realistic, multi-turn contexts:

  • Vary user goals, emotions, and ambiguity

  • Simulate adversarial behavior and edge cases

  • Surface failures that emerge only after several interactions

Simulation is the only way to expose compound, real-world risks before they reach production.

3. LLM-as-Judge: Nuanced, Scalable Grading

Use focused LLM-based graders to evaluate:

  • Goal achievement

  • Policy adherence

  • Quality and clarity of responses

  • Termination logic and handoff correctness

Calibrate LLM graders against human raters, regularly updating rubrics to reflect evolving needs.

4. Continuous Refresh: Protect Against Eval Saturation

  • Regularly update test cases and rubrics

  • Introduce harder examples as agents improve

  • Expand coverage to match changing user requirements

Your eval suite should evolve alongside your agent.

CTO Checklist for Eval-Driven Agent Engineering

  • Integrate code-based checks into every CI/CD cycle

  • Build simulation frameworks to test agents at scale

  • Deploy LLM-as-judge grading for nuanced evaluation

  • Include human-in-the-loop calibration for new scenarios

  • Continuously refresh evals to avoid saturation

  • Track both capability ceilings and regression floors

Anthropic’s Conclusion: Evals and Simulation Are Agentic Infrastructure

The message from Demystifying evals for AI agents is unambiguous. Evals are not optional—they are the only way to ship autonomy safely and scalably. For CTOs building agentic systems, evaluative infrastructure is as critical as the agent models themselves.

  • Evals define success: They set the standards, boundaries, and metrics that anchor your system.

  • Simulation reveals failure: It stress-tests your agent, surfacing hard-to-predict risks before users ever encounter them.

  • Combined, they build trustworthy autonomy: Without evals and simulation, agents are unscalable, unreliable, and potentially unsafe.

Final Takeaways and Immediate Next Steps

Agentic systems represent the future of AI-driven products, but they cannot be trusted or scaled without rigorous eval-driven development. Anthropic’s pragmatic approach offers actionable guidance for CTOs and technical founders:

  • Treat evals as first-class infrastructure, integral to your deployment pipeline.

  • Blend code-based, LLM-as-judge, and simulation-driven grading for comprehensive coverage.

  • Avoid eval saturation by regularly refreshing your test suites.

  • Use simulation to expose real-world, multi-turn risks that static tests will never catch.

If you’re leading a startup building agentic architectures, invest in your eval stack as aggressively as you do in your core models. The difference between scalable, safe autonomy and brittle, unpredictable behavior starts—and often ends—with how you evaluate and iterate on your agents.

Further Reading

Evals and simulation are not just research tools; they are the backbone of agentic engineering. Your future users—and your business—depend on getting this right.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.