Your Co-pilot Runs the Loop

/okareo:improve harnesses your copilot and Okareo to make your agent better.

Simulation

Matt Wyman

,

CEO/Co-founder

June 10, 2026

Simulation finds the failure. Claude proposes the fix. You approve the change.

How well does your agent work?

How do you know? How many scenarios have you actually tried? Ten? Fifty? The same five happy paths your team has been replaying since the pilot?

If you're like most teams shipping agents right now, the honest answer is: you find out in production.

The cost of finding out in production

Agent failures are not stack traces. There is no exception to catch, no 500 to alert on. The failures that matter are behaviors: an agent that gets pulled off-policy when a caller pushes, tone that drifts over a long conversation, a tool called with hallucinated arguments, a refusal that should have happened and quietly didn't.

Observability sees all of this. After the fact. After a real customer lived it.

That gap has a price, and it's not a technical one. Every off-policy answer in production is churn risk you paid to discover. Every quiet failure is an escalation, a compliance question, a screenshot on someone's slide. Enterprises are deploying agents faster than they can verify them, and the verification debt lands on the customer experience.

Simulation in 60 seconds

The alternative is to find these behaviors before customers do. That's what simulation is for.

A Driver is a synthetic user with a personality, context, and a goal. A Target is the agent under test. A Scenario puts them together, and the Driver actually converses with your agent. For voice agents, it places real calls. Checks score what happened, and the results land in a Scorecard.

Thirty Drivers can try thirty things your QA script never thought of. Not like scripts. Like users.

Finding the failure is step one of five

Simulation tells you your agent gets pulled off-topic by 40 percent of pushy callers. Good. Now what?

Now someone reads the failing transcripts. Diagnoses the pattern. Edits the prompt or the flow. Redeploys. Re-runs the simulation. Compares Scorecards. Decides whether it actually got better or just got different.

That loop is an afternoon per cycle. Days per iteration. And it's your most expensive engineers doing it, because they're the only ones who can hold the whole loop in their heads.

This is the part nobody puts on the architecture diagram: the bottleneck in agent quality isn't detection. It's the remediation loop.

Hand the loop to your co-pilot

So we handed it to the co-pilot.

/okareo:improve ships in the Okareo plugin for Claude Code: one install gets you the Okareo MCP server and the Skills that know how to use it. You type one sentence:

/okareo:improve 5 cycles "How the agent deflects 
  when a non-relevant topic is raised"

A cycle count. A behavior objective in plain language.

Before anything runs, the Skill makes you frame the problem. What does success look like, written down. What counts as a failure. And where this agent gets edited, which the Skill calls the Edit Target: a repo Claude edits directly, an MCP tool for your agent platform, or a diff you apply yourself. The loop refuses to start without a success definition. A loop without one produces unscored runs and no decision.

Then each cycle: Claude calls run_simulation through the Okareo MCP and reads the results as structured data, not as a dashboard. It pulls the failing transcripts, names a single root cause, and proposes exactly one change as a Change Spec: the edit, the rationale, and a before/after snapshot of the config. One change per cycle, or the before/after isn't attributable. Then it pauses and waits for you.

You approve. The change goes out through the Edit Target, the same harness re-runs, same Drivers, same Scenarios, same Checks, and the Scorecards get compared. Every cycle lands in a ledger in your repo, so the trend is a file, not a memory. Repeat until the behavior is resolved or the cycles run out.

The prerequisite is deliberately small: the Okareo MCP, plus any way to change and deploy your agent. The Skill embeds no knowledge of any agent platform. The edit is delegated to wherever your agent actually lives, so if you can change it and ship it programmatically, the loop can run.

This isn't a feature of one agent platform. It's what happens when you give a coding agent a simulation tool and a way to deploy.

What's working today

Teams are running this in their pipelines today. Here's what one of our own recent runs looked like.

The objective was the one above: stop the agent from getting pulled off-topic. Cycle 1 established the baseline.

[EVIDENCE SLOT 1: Off-topic run, cycle 1 baseline.] Real numbers: how many of the N Drivers pulled the agent off-topic, the cycle 1 pass rate, and one short quotable transcript excerpt of a failing exchange.

Claude's diagnosis pointed at a specific gap, and the proposed change was concrete enough to review in a minute:

[EVIDENCE SLOT 2: The actual diff.] The real proposed change from the run, shown as a diff or before/after prompt excerpt. This is the strongest artifact in the post. Keep it short and real.

Approved. Deployed. Re-run.

[EVIDENCE SLOT 3: Cycle-over-cycle progression.] Real pass rates per cycle for the off-topic run, e.g. "Cycle 1: X% → Cycle 2: Y% → ... → Cycle N: Z%", plus total wall-clock time and number of cycles actually used vs. budgeted.

We then ran the same pattern against a completely different behavior: ending the call gracefully instead of trailing off or cutting the caller mid-thought. Different failure mode, different fix surface, same one-sentence invocation.

[EVIDENCE SLOT 4: Graceful-ending run, summarized.] Objective as typed, starting pass rate, ending pass rate, cycles used. One sentence on what the fix turned out to be. This run earns the generality claim, so it needs real numbers even if briefer than the first.

Two different behaviors. Two resolved loops. Minutes of developer attention per cycle, not afternoons. Evidence, not vibes.

The full utility here is not proven yet. We're not claiming it is. What we're claiming is narrower and checkable: the loop closes, on real agents, on real behavior objectives, today. The promise of where it goes is bigger than that.

The referee neither agent can bribe

If a co-pilot is grading its own homework, none of this means anything. An agent improving an agent needs a referee neither of them can bribe.

So the Skill runs inside four guardrails:

Supervised by default. Every proposed change pauses for your approval before it's applied. Walk away from your desk and nothing ships. Auto mode exists, running all the cycles unattended, but you opt into it explicitly, after you've watched the loop make the right kinds of changes. Trust is earned cycle by cycle, not assumed.

A hard cycle count. Five cycles means five. And because voice targets place real, billed calls, the Skill confirms that with you once, up front, before anything dials.

A referee it cannot write to. Every score comes from a real evaluation run on Okareo's side. The co-pilot reads Scorecards; it cannot author them. If a tool fails or a run doesn't complete, the Skill reports exactly that and stops. It is forbidden from papering over a gap with an estimated result. And the verdict on each cycle comes from the written analysis of the conversations, not from a single number an optimizer could learn to chase. Goodhart's Law is not a footnote here. It's the design constraint.

The same bar, every cycle. The identical harness re-runs each cycle, and the whole Scorecard is compared, not just the target metric. A fix that improves the objective while regressing something that used to pass is not a clean win, and a regression on a previously passing dimension stops the loop: revert, re-frame, or stop.

None of this makes the loop slower in practice. It makes the results mean something.

From one loop to a system

The same loop that runs in your editor runs everywhere your agent does. Fast regression checks gate every pull request in CI. Nightly sweeps run the full suite. Pre-release, you run the adversarial Scenarios. And in production, Synthetic Production Monitoring has a Driver call your live agent on a schedule and report back, so you hear about drift before your customers do.

Then the flywheel closes: failed real production calls become tomorrow's Scenarios. Production feeds development.

That's the actual destination. Not a tool that fixes one prompt. A system where agent quality stops depending on heroic engineers reading transcripts at midnight and becomes something your pipeline does, with your judgment at every gate.

What we have today is the first turn of that flywheel, and it's working. Teams are running it now. The rest is in motion.

Or just ask:


…and then get started:

/okareo:improve 5 cycles "your hardest behavior problem here"


Get started at okareo.com/mcp. The plugin, the Skills, and the MCP server are at github.com/okareo-ai/okareo-tools.

Simulation finds the failure. Claude proposes the fix. You approve the change.

How well does your agent work?

How do you know? How many scenarios have you actually tried? Ten? Fifty? The same five happy paths your team has been replaying since the pilot?

If you're like most teams shipping agents right now, the honest answer is: you find out in production.

The cost of finding out in production

Agent failures are not stack traces. There is no exception to catch, no 500 to alert on. The failures that matter are behaviors: an agent that gets pulled off-policy when a caller pushes, tone that drifts over a long conversation, a tool called with hallucinated arguments, a refusal that should have happened and quietly didn't.

Observability sees all of this. After the fact. After a real customer lived it.

That gap has a price, and it's not a technical one. Every off-policy answer in production is churn risk you paid to discover. Every quiet failure is an escalation, a compliance question, a screenshot on someone's slide. Enterprises are deploying agents faster than they can verify them, and the verification debt lands on the customer experience.

Simulation in 60 seconds

The alternative is to find these behaviors before customers do. That's what simulation is for.

A Driver is a synthetic user with a personality, context, and a goal. A Target is the agent under test. A Scenario puts them together, and the Driver actually converses with your agent. For voice agents, it places real calls. Checks score what happened, and the results land in a Scorecard.

Thirty Drivers can try thirty things your QA script never thought of. Not like scripts. Like users.

Finding the failure is step one of five

Simulation tells you your agent gets pulled off-topic by 40 percent of pushy callers. Good. Now what?

Now someone reads the failing transcripts. Diagnoses the pattern. Edits the prompt or the flow. Redeploys. Re-runs the simulation. Compares Scorecards. Decides whether it actually got better or just got different.

That loop is an afternoon per cycle. Days per iteration. And it's your most expensive engineers doing it, because they're the only ones who can hold the whole loop in their heads.

This is the part nobody puts on the architecture diagram: the bottleneck in agent quality isn't detection. It's the remediation loop.

Hand the loop to your co-pilot

So we handed it to the co-pilot.

/okareo:improve ships in the Okareo plugin for Claude Code: one install gets you the Okareo MCP server and the Skills that know how to use it. You type one sentence:

/okareo:improve 5 cycles "How the agent deflects 
  when a non-relevant topic is raised"

A cycle count. A behavior objective in plain language.

Before anything runs, the Skill makes you frame the problem. What does success look like, written down. What counts as a failure. And where this agent gets edited, which the Skill calls the Edit Target: a repo Claude edits directly, an MCP tool for your agent platform, or a diff you apply yourself. The loop refuses to start without a success definition. A loop without one produces unscored runs and no decision.

Then each cycle: Claude calls run_simulation through the Okareo MCP and reads the results as structured data, not as a dashboard. It pulls the failing transcripts, names a single root cause, and proposes exactly one change as a Change Spec: the edit, the rationale, and a before/after snapshot of the config. One change per cycle, or the before/after isn't attributable. Then it pauses and waits for you.

You approve. The change goes out through the Edit Target, the same harness re-runs, same Drivers, same Scenarios, same Checks, and the Scorecards get compared. Every cycle lands in a ledger in your repo, so the trend is a file, not a memory. Repeat until the behavior is resolved or the cycles run out.

The prerequisite is deliberately small: the Okareo MCP, plus any way to change and deploy your agent. The Skill embeds no knowledge of any agent platform. The edit is delegated to wherever your agent actually lives, so if you can change it and ship it programmatically, the loop can run.

This isn't a feature of one agent platform. It's what happens when you give a coding agent a simulation tool and a way to deploy.

What's working today

Teams are running this in their pipelines today. Here's what one of our own recent runs looked like.

The objective was the one above: stop the agent from getting pulled off-topic. Cycle 1 established the baseline.

[EVIDENCE SLOT 1: Off-topic run, cycle 1 baseline.] Real numbers: how many of the N Drivers pulled the agent off-topic, the cycle 1 pass rate, and one short quotable transcript excerpt of a failing exchange.

Claude's diagnosis pointed at a specific gap, and the proposed change was concrete enough to review in a minute:

[EVIDENCE SLOT 2: The actual diff.] The real proposed change from the run, shown as a diff or before/after prompt excerpt. This is the strongest artifact in the post. Keep it short and real.

Approved. Deployed. Re-run.

[EVIDENCE SLOT 3: Cycle-over-cycle progression.] Real pass rates per cycle for the off-topic run, e.g. "Cycle 1: X% → Cycle 2: Y% → ... → Cycle N: Z%", plus total wall-clock time and number of cycles actually used vs. budgeted.

We then ran the same pattern against a completely different behavior: ending the call gracefully instead of trailing off or cutting the caller mid-thought. Different failure mode, different fix surface, same one-sentence invocation.

[EVIDENCE SLOT 4: Graceful-ending run, summarized.] Objective as typed, starting pass rate, ending pass rate, cycles used. One sentence on what the fix turned out to be. This run earns the generality claim, so it needs real numbers even if briefer than the first.

Two different behaviors. Two resolved loops. Minutes of developer attention per cycle, not afternoons. Evidence, not vibes.

The full utility here is not proven yet. We're not claiming it is. What we're claiming is narrower and checkable: the loop closes, on real agents, on real behavior objectives, today. The promise of where it goes is bigger than that.

The referee neither agent can bribe

If a co-pilot is grading its own homework, none of this means anything. An agent improving an agent needs a referee neither of them can bribe.

So the Skill runs inside four guardrails:

Supervised by default. Every proposed change pauses for your approval before it's applied. Walk away from your desk and nothing ships. Auto mode exists, running all the cycles unattended, but you opt into it explicitly, after you've watched the loop make the right kinds of changes. Trust is earned cycle by cycle, not assumed.

A hard cycle count. Five cycles means five. And because voice targets place real, billed calls, the Skill confirms that with you once, up front, before anything dials.

A referee it cannot write to. Every score comes from a real evaluation run on Okareo's side. The co-pilot reads Scorecards; it cannot author them. If a tool fails or a run doesn't complete, the Skill reports exactly that and stops. It is forbidden from papering over a gap with an estimated result. And the verdict on each cycle comes from the written analysis of the conversations, not from a single number an optimizer could learn to chase. Goodhart's Law is not a footnote here. It's the design constraint.

The same bar, every cycle. The identical harness re-runs each cycle, and the whole Scorecard is compared, not just the target metric. A fix that improves the objective while regressing something that used to pass is not a clean win, and a regression on a previously passing dimension stops the loop: revert, re-frame, or stop.

None of this makes the loop slower in practice. It makes the results mean something.

From one loop to a system

The same loop that runs in your editor runs everywhere your agent does. Fast regression checks gate every pull request in CI. Nightly sweeps run the full suite. Pre-release, you run the adversarial Scenarios. And in production, Synthetic Production Monitoring has a Driver call your live agent on a schedule and report back, so you hear about drift before your customers do.

Then the flywheel closes: failed real production calls become tomorrow's Scenarios. Production feeds development.

That's the actual destination. Not a tool that fixes one prompt. A system where agent quality stops depending on heroic engineers reading transcripts at midnight and becomes something your pipeline does, with your judgment at every gate.

What we have today is the first turn of that flywheel, and it's working. Teams are running it now. The rest is in motion.

Or just ask:


…and then get started:

/okareo:improve 5 cycles "your hardest behavior problem here"


Get started at okareo.com/mcp. The plugin, the Skills, and the MCP server are at github.com/okareo-ai/okareo-tools.

Simulation finds the failure. Claude proposes the fix. You approve the change.

How well does your agent work?

How do you know? How many scenarios have you actually tried? Ten? Fifty? The same five happy paths your team has been replaying since the pilot?

If you're like most teams shipping agents right now, the honest answer is: you find out in production.

The cost of finding out in production

Agent failures are not stack traces. There is no exception to catch, no 500 to alert on. The failures that matter are behaviors: an agent that gets pulled off-policy when a caller pushes, tone that drifts over a long conversation, a tool called with hallucinated arguments, a refusal that should have happened and quietly didn't.

Observability sees all of this. After the fact. After a real customer lived it.

That gap has a price, and it's not a technical one. Every off-policy answer in production is churn risk you paid to discover. Every quiet failure is an escalation, a compliance question, a screenshot on someone's slide. Enterprises are deploying agents faster than they can verify them, and the verification debt lands on the customer experience.

Simulation in 60 seconds

The alternative is to find these behaviors before customers do. That's what simulation is for.

A Driver is a synthetic user with a personality, context, and a goal. A Target is the agent under test. A Scenario puts them together, and the Driver actually converses with your agent. For voice agents, it places real calls. Checks score what happened, and the results land in a Scorecard.

Thirty Drivers can try thirty things your QA script never thought of. Not like scripts. Like users.

Finding the failure is step one of five

Simulation tells you your agent gets pulled off-topic by 40 percent of pushy callers. Good. Now what?

Now someone reads the failing transcripts. Diagnoses the pattern. Edits the prompt or the flow. Redeploys. Re-runs the simulation. Compares Scorecards. Decides whether it actually got better or just got different.

That loop is an afternoon per cycle. Days per iteration. And it's your most expensive engineers doing it, because they're the only ones who can hold the whole loop in their heads.

This is the part nobody puts on the architecture diagram: the bottleneck in agent quality isn't detection. It's the remediation loop.

Hand the loop to your co-pilot

So we handed it to the co-pilot.

/okareo:improve ships in the Okareo plugin for Claude Code: one install gets you the Okareo MCP server and the Skills that know how to use it. You type one sentence:

/okareo:improve 5 cycles "How the agent deflects 
  when a non-relevant topic is raised"

A cycle count. A behavior objective in plain language.

Before anything runs, the Skill makes you frame the problem. What does success look like, written down. What counts as a failure. And where this agent gets edited, which the Skill calls the Edit Target: a repo Claude edits directly, an MCP tool for your agent platform, or a diff you apply yourself. The loop refuses to start without a success definition. A loop without one produces unscored runs and no decision.

Then each cycle: Claude calls run_simulation through the Okareo MCP and reads the results as structured data, not as a dashboard. It pulls the failing transcripts, names a single root cause, and proposes exactly one change as a Change Spec: the edit, the rationale, and a before/after snapshot of the config. One change per cycle, or the before/after isn't attributable. Then it pauses and waits for you.

You approve. The change goes out through the Edit Target, the same harness re-runs, same Drivers, same Scenarios, same Checks, and the Scorecards get compared. Every cycle lands in a ledger in your repo, so the trend is a file, not a memory. Repeat until the behavior is resolved or the cycles run out.

The prerequisite is deliberately small: the Okareo MCP, plus any way to change and deploy your agent. The Skill embeds no knowledge of any agent platform. The edit is delegated to wherever your agent actually lives, so if you can change it and ship it programmatically, the loop can run.

This isn't a feature of one agent platform. It's what happens when you give a coding agent a simulation tool and a way to deploy.

What's working today

Teams are running this in their pipelines today. Here's what one of our own recent runs looked like.

The objective was the one above: stop the agent from getting pulled off-topic. Cycle 1 established the baseline.

[EVIDENCE SLOT 1: Off-topic run, cycle 1 baseline.] Real numbers: how many of the N Drivers pulled the agent off-topic, the cycle 1 pass rate, and one short quotable transcript excerpt of a failing exchange.

Claude's diagnosis pointed at a specific gap, and the proposed change was concrete enough to review in a minute:

[EVIDENCE SLOT 2: The actual diff.] The real proposed change from the run, shown as a diff or before/after prompt excerpt. This is the strongest artifact in the post. Keep it short and real.

Approved. Deployed. Re-run.

[EVIDENCE SLOT 3: Cycle-over-cycle progression.] Real pass rates per cycle for the off-topic run, e.g. "Cycle 1: X% → Cycle 2: Y% → ... → Cycle N: Z%", plus total wall-clock time and number of cycles actually used vs. budgeted.

We then ran the same pattern against a completely different behavior: ending the call gracefully instead of trailing off or cutting the caller mid-thought. Different failure mode, different fix surface, same one-sentence invocation.

[EVIDENCE SLOT 4: Graceful-ending run, summarized.] Objective as typed, starting pass rate, ending pass rate, cycles used. One sentence on what the fix turned out to be. This run earns the generality claim, so it needs real numbers even if briefer than the first.

Two different behaviors. Two resolved loops. Minutes of developer attention per cycle, not afternoons. Evidence, not vibes.

The full utility here is not proven yet. We're not claiming it is. What we're claiming is narrower and checkable: the loop closes, on real agents, on real behavior objectives, today. The promise of where it goes is bigger than that.

The referee neither agent can bribe

If a co-pilot is grading its own homework, none of this means anything. An agent improving an agent needs a referee neither of them can bribe.

So the Skill runs inside four guardrails:

Supervised by default. Every proposed change pauses for your approval before it's applied. Walk away from your desk and nothing ships. Auto mode exists, running all the cycles unattended, but you opt into it explicitly, after you've watched the loop make the right kinds of changes. Trust is earned cycle by cycle, not assumed.

A hard cycle count. Five cycles means five. And because voice targets place real, billed calls, the Skill confirms that with you once, up front, before anything dials.

A referee it cannot write to. Every score comes from a real evaluation run on Okareo's side. The co-pilot reads Scorecards; it cannot author them. If a tool fails or a run doesn't complete, the Skill reports exactly that and stops. It is forbidden from papering over a gap with an estimated result. And the verdict on each cycle comes from the written analysis of the conversations, not from a single number an optimizer could learn to chase. Goodhart's Law is not a footnote here. It's the design constraint.

The same bar, every cycle. The identical harness re-runs each cycle, and the whole Scorecard is compared, not just the target metric. A fix that improves the objective while regressing something that used to pass is not a clean win, and a regression on a previously passing dimension stops the loop: revert, re-frame, or stop.

None of this makes the loop slower in practice. It makes the results mean something.

From one loop to a system

The same loop that runs in your editor runs everywhere your agent does. Fast regression checks gate every pull request in CI. Nightly sweeps run the full suite. Pre-release, you run the adversarial Scenarios. And in production, Synthetic Production Monitoring has a Driver call your live agent on a schedule and report back, so you hear about drift before your customers do.

Then the flywheel closes: failed real production calls become tomorrow's Scenarios. Production feeds development.

That's the actual destination. Not a tool that fixes one prompt. A system where agent quality stops depending on heroic engineers reading transcripts at midnight and becomes something your pipeline does, with your judgment at every gate.

What we have today is the first turn of that flywheel, and it's working. Teams are running it now. The rest is in motion.

Or just ask:


…and then get started:

/okareo:improve 5 cycles "your hardest behavior problem here"


Get started at okareo.com/mcp. The plugin, the Skills, and the MCP server are at github.com/okareo-ai/okareo-tools.

Simulation finds the failure. Claude proposes the fix. You approve the change.

How well does your agent work?

How do you know? How many scenarios have you actually tried? Ten? Fifty? The same five happy paths your team has been replaying since the pilot?

If you're like most teams shipping agents right now, the honest answer is: you find out in production.

The cost of finding out in production

Agent failures are not stack traces. There is no exception to catch, no 500 to alert on. The failures that matter are behaviors: an agent that gets pulled off-policy when a caller pushes, tone that drifts over a long conversation, a tool called with hallucinated arguments, a refusal that should have happened and quietly didn't.

Observability sees all of this. After the fact. After a real customer lived it.

That gap has a price, and it's not a technical one. Every off-policy answer in production is churn risk you paid to discover. Every quiet failure is an escalation, a compliance question, a screenshot on someone's slide. Enterprises are deploying agents faster than they can verify them, and the verification debt lands on the customer experience.

Simulation in 60 seconds

The alternative is to find these behaviors before customers do. That's what simulation is for.

A Driver is a synthetic user with a personality, context, and a goal. A Target is the agent under test. A Scenario puts them together, and the Driver actually converses with your agent. For voice agents, it places real calls. Checks score what happened, and the results land in a Scorecard.

Thirty Drivers can try thirty things your QA script never thought of. Not like scripts. Like users.

Finding the failure is step one of five

Simulation tells you your agent gets pulled off-topic by 40 percent of pushy callers. Good. Now what?

Now someone reads the failing transcripts. Diagnoses the pattern. Edits the prompt or the flow. Redeploys. Re-runs the simulation. Compares Scorecards. Decides whether it actually got better or just got different.

That loop is an afternoon per cycle. Days per iteration. And it's your most expensive engineers doing it, because they're the only ones who can hold the whole loop in their heads.

This is the part nobody puts on the architecture diagram: the bottleneck in agent quality isn't detection. It's the remediation loop.

Hand the loop to your co-pilot

So we handed it to the co-pilot.

/okareo:improve ships in the Okareo plugin for Claude Code: one install gets you the Okareo MCP server and the Skills that know how to use it. You type one sentence:

/okareo:improve 5 cycles "How the agent deflects 
  when a non-relevant topic is raised"

A cycle count. A behavior objective in plain language.

Before anything runs, the Skill makes you frame the problem. What does success look like, written down. What counts as a failure. And where this agent gets edited, which the Skill calls the Edit Target: a repo Claude edits directly, an MCP tool for your agent platform, or a diff you apply yourself. The loop refuses to start without a success definition. A loop without one produces unscored runs and no decision.

Then each cycle: Claude calls run_simulation through the Okareo MCP and reads the results as structured data, not as a dashboard. It pulls the failing transcripts, names a single root cause, and proposes exactly one change as a Change Spec: the edit, the rationale, and a before/after snapshot of the config. One change per cycle, or the before/after isn't attributable. Then it pauses and waits for you.

You approve. The change goes out through the Edit Target, the same harness re-runs, same Drivers, same Scenarios, same Checks, and the Scorecards get compared. Every cycle lands in a ledger in your repo, so the trend is a file, not a memory. Repeat until the behavior is resolved or the cycles run out.

The prerequisite is deliberately small: the Okareo MCP, plus any way to change and deploy your agent. The Skill embeds no knowledge of any agent platform. The edit is delegated to wherever your agent actually lives, so if you can change it and ship it programmatically, the loop can run.

This isn't a feature of one agent platform. It's what happens when you give a coding agent a simulation tool and a way to deploy.

What's working today

Teams are running this in their pipelines today. Here's what one of our own recent runs looked like.

The objective was the one above: stop the agent from getting pulled off-topic. Cycle 1 established the baseline.

[EVIDENCE SLOT 1: Off-topic run, cycle 1 baseline.] Real numbers: how many of the N Drivers pulled the agent off-topic, the cycle 1 pass rate, and one short quotable transcript excerpt of a failing exchange.

Claude's diagnosis pointed at a specific gap, and the proposed change was concrete enough to review in a minute:

[EVIDENCE SLOT 2: The actual diff.] The real proposed change from the run, shown as a diff or before/after prompt excerpt. This is the strongest artifact in the post. Keep it short and real.

Approved. Deployed. Re-run.

[EVIDENCE SLOT 3: Cycle-over-cycle progression.] Real pass rates per cycle for the off-topic run, e.g. "Cycle 1: X% → Cycle 2: Y% → ... → Cycle N: Z%", plus total wall-clock time and number of cycles actually used vs. budgeted.

We then ran the same pattern against a completely different behavior: ending the call gracefully instead of trailing off or cutting the caller mid-thought. Different failure mode, different fix surface, same one-sentence invocation.

[EVIDENCE SLOT 4: Graceful-ending run, summarized.] Objective as typed, starting pass rate, ending pass rate, cycles used. One sentence on what the fix turned out to be. This run earns the generality claim, so it needs real numbers even if briefer than the first.

Two different behaviors. Two resolved loops. Minutes of developer attention per cycle, not afternoons. Evidence, not vibes.

The full utility here is not proven yet. We're not claiming it is. What we're claiming is narrower and checkable: the loop closes, on real agents, on real behavior objectives, today. The promise of where it goes is bigger than that.

The referee neither agent can bribe

If a co-pilot is grading its own homework, none of this means anything. An agent improving an agent needs a referee neither of them can bribe.

So the Skill runs inside four guardrails:

Supervised by default. Every proposed change pauses for your approval before it's applied. Walk away from your desk and nothing ships. Auto mode exists, running all the cycles unattended, but you opt into it explicitly, after you've watched the loop make the right kinds of changes. Trust is earned cycle by cycle, not assumed.

A hard cycle count. Five cycles means five. And because voice targets place real, billed calls, the Skill confirms that with you once, up front, before anything dials.

A referee it cannot write to. Every score comes from a real evaluation run on Okareo's side. The co-pilot reads Scorecards; it cannot author them. If a tool fails or a run doesn't complete, the Skill reports exactly that and stops. It is forbidden from papering over a gap with an estimated result. And the verdict on each cycle comes from the written analysis of the conversations, not from a single number an optimizer could learn to chase. Goodhart's Law is not a footnote here. It's the design constraint.

The same bar, every cycle. The identical harness re-runs each cycle, and the whole Scorecard is compared, not just the target metric. A fix that improves the objective while regressing something that used to pass is not a clean win, and a regression on a previously passing dimension stops the loop: revert, re-frame, or stop.

None of this makes the loop slower in practice. It makes the results mean something.

From one loop to a system

The same loop that runs in your editor runs everywhere your agent does. Fast regression checks gate every pull request in CI. Nightly sweeps run the full suite. Pre-release, you run the adversarial Scenarios. And in production, Synthetic Production Monitoring has a Driver call your live agent on a schedule and report back, so you hear about drift before your customers do.

Then the flywheel closes: failed real production calls become tomorrow's Scenarios. Production feeds development.

That's the actual destination. Not a tool that fixes one prompt. A system where agent quality stops depending on heroic engineers reading transcripts at midnight and becomes something your pipeline does, with your judgment at every gate.

What we have today is the first turn of that flywheel, and it's working. Teams are running it now. The rest is in motion.

Or just ask:


…and then get started:

/okareo:improve 5 cycles "your hardest behavior problem here"


Get started at okareo.com/mcp. The plugin, the Skills, and the MCP server are at github.com/okareo-ai/okareo-tools.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.