Your Co-pilot Runs the Loop

Simulation

Matt Wyman
,
CEO/Co-founder
June 10, 2026
Simulation finds the failure. Claude proposes the fix. You approve the change.
How well does your agent work?
How do you know? How many scenarios have you actually tried? Ten? Fifty? The same five happy paths your team has been replaying since the pilot?
If you're like most teams shipping agents right now, the honest answer is: you find out in production.
The cost of finding out in production
Agent failures are not stack traces. There is no exception to catch, no 500 to alert on. The failures that matter are behaviors: an agent that gets pulled off-policy when a caller pushes, tone that drifts over a long conversation, a tool called with hallucinated arguments, a refusal that should have happened and quietly didn't.
Observability sees all of this. After the fact. After a real customer lived it.
That gap has a price, and it's not a technical one. Every off-policy answer in production is churn risk you paid to discover. Every quiet failure is an escalation, a compliance question, a screenshot on someone's slide. Enterprises are deploying agents faster than they can verify them, and the verification debt lands on the customer experience.
Simulation in 60 seconds
The alternative is to find these behaviors before customers do. That's what simulation is for.
A Driver is a synthetic user with a personality, context, and a goal. A Target is the agent under test. A Scenario puts them together, and the Driver actually converses with your agent. For voice agents, it places real calls. Checks score what happened, and the results land in a Scorecard.
Thirty Drivers can try thirty things your QA script never thought of. Not like scripts. Like users.
Finding the failure is step one of five
Simulation tells you your agent gets pulled off-topic by 40 percent of pushy callers. Good. Now what?
Now someone reads the failing transcripts. Diagnoses the pattern. Edits the prompt or the flow. Redeploys. Re-runs the simulation. Compares Scorecards. Decides whether it actually got better or just got different.
That loop is an afternoon per cycle. Days per iteration. And it's your most critical team member doing it, because they're the only ones who can hold the whole loop in their heads.
This is the part nobody puts on the architecture diagram: the bottleneck in agent quality isn't detection. It's the remediation loop.
Hand the loop to your co-pilot
So we handed it to the co-pilot.
/okareo:improve ships in the Okareo plugin for Claude Code: one install gets you the Okareo MCP server and the Skills that know how to use it. You type one sentence:
/okareo:improve 5 cycles "How the agent deflects when a non-relevant topic is raised"
A cycle count. A behavior objective in plain language.
Before anything runs, the Skill makes you frame the problem. What does success look like, written down. What counts as a failure. And where this agent gets edited, which the Skill calls the Edit Target: a repo Claude edits directly, an MCP tool for your agent platform, or a diff you apply yourself. The loop refuses to start without a success definition. A loop without one produces unscored runs and no decision.
Then each cycle: Claude calls run_simulation through the Okareo MCP and reads the results as structured data, not as a dashboard. It pulls the failing transcripts, names a single root cause, and proposes exactly one change as a Change Spec: the edit, the rationale, and a before/after snapshot of the config. One change per cycle, or the before/after isn't attributable. Then it pauses and waits for you.
You approve. The change goes out through the Edit Target, the same harness re-runs, same Drivers, same Scenarios, same Checks, and the Scorecards get compared. Every cycle lands in a ledger in your repo, so the trend is a file, not a memory. Repeat until the behavior is resolved or the cycles run out.
The prerequisite is deliberately small: the Okareo MCP, plus any way to change and deploy your agent. The Skill embeds no knowledge of any agent platform. The edit is delegated to wherever your agent actually lives, so if you can change it and ship it programmatically, the loop can run.
This isn't a feature of one agent platform. It's what happens when you give a coding agent a simulation tool and a way to deploy.
What's working today
Teams are running this in their pipelines today. Here's one of our own recent runs, end to end.
The target was a voice agent that answers documentation questions. The behavior: end the call cleanly once the caller is satisfied. You know this failure because you've lived it as a caller. The agent answers your question, you say thanks, and then it offers two more things, says goodbye twice, and reads a canned closing after you've mentally hung up.
The loop didn't start by placing calls. It started from evidence we already had: it took a run from an earlier simulation against the same agent, defined a new Clean Call Ending Check, and re-scored the existing transcripts against it. Re-scores are fast and don't bill a single call, so the baseline cost minutes, not a round of voice traffic. The verdict on that prior run: zero of five conversations ended cleanly. Trailing offers, duplicate goodbyes, post-goodbye boilerplate. Directional rather than same-harness, since the earlier run probed different scenarios, but it told the loop exactly what to build: a wrap-up Scenario with a satisfied-caller Driver and four probes, run fresh from cycle one.
Cycle 1 found a prompt problem. The agent's closing policy literally mandated ending every response with "one concrete next step," and had no goodbye rule at all. The agent wasn't being chatty. It was obeying. The proposed change replaced that section with an ending policy: one short close, nothing after it. Approved, applied, re-run. Three of four closings were now correct in content, but the calls still didn't end.
Cycle 2 found the real problem, and it wasn't in the prompt. Reading the transcripts, the loop's diagnosis was that the agent had no hang-up mechanism at all. End-call was disabled in the platform config, no end-call phrases were registered, and a canned "Goodbye." teardown message was being re-spoken after the agent had already said goodbye. No prompt edit could ever fix that. The proposed change crossed into platform configuration, and when the platform's MCP didn't expose those fields, the loop went through the platform's API directly. After approval: four of four clean endings, verified against raw transcripts. Average call time dropped from about 3.1 to 2.5 minutes, because the agent stopped billing dead air after the conversation was over.
Cycle 3 audited the referee instead of the agent. The raw Check score came back low even though the transcripts showed four clean endings. The loop's verdict comes from reading the conversation analysis, not from chasing the number, and the diagnosis was that the score was a measurement artifact: the final message was duplicated in the history the judges were rendering. The change that cycle was to the Check, not the agent. A loop that only optimized the metric would have "improved" the agent against a broken ruler.
One sharp edge, named plainly: the hang-up now triggers on specific phrases, so any agent utterance containing "goodbye" ends the call. The prompt confines that word to closings, but it's a real constraint of the mechanism, not something we'd hide in a footnote.
The off-topic objective you saw in the command above ran the same way: same one-sentence invocation, different behavior, different fix surface. One pattern, two resolved behaviors.
Evidence, not vibes. And notice what the loop actually spent: a free re-score for the baseline, one simulation per cycle, and minutes of your attention at each approval gate.
The full utility here is not proven yet. We're not claiming it is. What we're claiming is narrower and checkable: the loop closes, on real agents, on real behavior objectives, today, and the fixes it finds aren't confined to the prompt. The promise of where it goes is bigger than that.
The referee neither agent can bribe
If a co-pilot is grading its own homework, none of this means anything. An agent improving an agent needs a referee neither of them can bribe.
So the Skill runs inside four guardrails:
Supervised by default. Every proposed change pauses for your approval before it's applied. Walk away from your desk and nothing ships. Auto mode exists, running all the cycles unattended, but you opt into it explicitly, after you've watched the loop make the right kinds of changes. Trust is earned cycle by cycle, not assumed.
A hard cycle count. Five cycles means five. And because voice targets place real, billed calls, the Skill confirms that with you once, up front, before anything dials.
A referee it cannot write to. Every score comes from a real evaluation run on Okareo's side. The co-pilot reads Scorecards; it cannot author them. If a tool fails or a run doesn't complete, the Skill reports exactly that and stops. It is forbidden from papering over a gap with an estimated result. And the verdict on each cycle comes from the written analysis of the conversations, not from a single number an optimizer could learn to chase. Goodhart's Law is not a footnote here. It's the design constraint.
The same bar, every cycle. The identical harness re-runs each cycle, and the whole Scorecard is compared, not just the target metric. A fix that improves the objective while regressing something that used to pass is not a clean win, and a regression on a previously passing dimension stops the loop: revert, re-frame, or stop. And if the bar itself turns out to be broken, fixing the bar is a logged cycle of its own, audited against raw transcripts, never a quiet adjustment.
None of this makes the loop slower in practice. It makes the results mean something.
From one loop to a system
The same loop that runs in your editor runs everywhere your agent does. Fast regression checks gate every pull request in CI. Nightly sweeps run the full suite. Pre-release, you run the adversarial Scenarios. And in production, Synthetic Production Monitoring has a Driver call your live agent on a schedule and report back, so you hear about drift before your customers do.
Then the flywheel closes: failed real production calls become tomorrow's Scenarios. Production feeds development.
That's the actual destination. Not a tool that fixes one prompt. A system where agent quality stops depending on heroic engineers reading transcripts at midnight and becomes something your pipeline does, with your judgment at every gate.
What we have today is the first turn of that flywheel, and it's working. Teams are running it now. The rest is in motion.
Or just ask:
/plugin marketplace add okareo-ai/okareo-tools /plugin install okareo@okareo-tools /okareo:improve 5 cycles "your hardest behavior problem here"
Get started at okareo.com/mcp. The plugin, the Skills, and the MCP server are at github.com/okareo-ai/okareo-tools.
Simulation finds the failure. Claude proposes the fix. You approve the change.
How well does your agent work?
How do you know? How many scenarios have you actually tried? Ten? Fifty? The same five happy paths your team has been replaying since the pilot?
If you're like most teams shipping agents right now, the honest answer is: you find out in production.
The cost of finding out in production
Agent failures are not stack traces. There is no exception to catch, no 500 to alert on. The failures that matter are behaviors: an agent that gets pulled off-policy when a caller pushes, tone that drifts over a long conversation, a tool called with hallucinated arguments, a refusal that should have happened and quietly didn't.
Observability sees all of this. After the fact. After a real customer lived it.
That gap has a price, and it's not a technical one. Every off-policy answer in production is churn risk you paid to discover. Every quiet failure is an escalation, a compliance question, a screenshot on someone's slide. Enterprises are deploying agents faster than they can verify them, and the verification debt lands on the customer experience.
Simulation in 60 seconds
The alternative is to find these behaviors before customers do. That's what simulation is for.
A Driver is a synthetic user with a personality, context, and a goal. A Target is the agent under test. A Scenario puts them together, and the Driver actually converses with your agent. For voice agents, it places real calls. Checks score what happened, and the results land in a Scorecard.
Thirty Drivers can try thirty things your QA script never thought of. Not like scripts. Like users.
Finding the failure is step one of five
Simulation tells you your agent gets pulled off-topic by 40 percent of pushy callers. Good. Now what?
Now someone reads the failing transcripts. Diagnoses the pattern. Edits the prompt or the flow. Redeploys. Re-runs the simulation. Compares Scorecards. Decides whether it actually got better or just got different.
That loop is an afternoon per cycle. Days per iteration. And it's your most critical team member doing it, because they're the only ones who can hold the whole loop in their heads.
This is the part nobody puts on the architecture diagram: the bottleneck in agent quality isn't detection. It's the remediation loop.
Hand the loop to your co-pilot
So we handed it to the co-pilot.
/okareo:improve ships in the Okareo plugin for Claude Code: one install gets you the Okareo MCP server and the Skills that know how to use it. You type one sentence:
/okareo:improve 5 cycles "How the agent deflects when a non-relevant topic is raised"
A cycle count. A behavior objective in plain language.
Before anything runs, the Skill makes you frame the problem. What does success look like, written down. What counts as a failure. And where this agent gets edited, which the Skill calls the Edit Target: a repo Claude edits directly, an MCP tool for your agent platform, or a diff you apply yourself. The loop refuses to start without a success definition. A loop without one produces unscored runs and no decision.
Then each cycle: Claude calls run_simulation through the Okareo MCP and reads the results as structured data, not as a dashboard. It pulls the failing transcripts, names a single root cause, and proposes exactly one change as a Change Spec: the edit, the rationale, and a before/after snapshot of the config. One change per cycle, or the before/after isn't attributable. Then it pauses and waits for you.
You approve. The change goes out through the Edit Target, the same harness re-runs, same Drivers, same Scenarios, same Checks, and the Scorecards get compared. Every cycle lands in a ledger in your repo, so the trend is a file, not a memory. Repeat until the behavior is resolved or the cycles run out.
The prerequisite is deliberately small: the Okareo MCP, plus any way to change and deploy your agent. The Skill embeds no knowledge of any agent platform. The edit is delegated to wherever your agent actually lives, so if you can change it and ship it programmatically, the loop can run.
This isn't a feature of one agent platform. It's what happens when you give a coding agent a simulation tool and a way to deploy.
What's working today
Teams are running this in their pipelines today. Here's one of our own recent runs, end to end.
The target was a voice agent that answers documentation questions. The behavior: end the call cleanly once the caller is satisfied. You know this failure because you've lived it as a caller. The agent answers your question, you say thanks, and then it offers two more things, says goodbye twice, and reads a canned closing after you've mentally hung up.
The loop didn't start by placing calls. It started from evidence we already had: it took a run from an earlier simulation against the same agent, defined a new Clean Call Ending Check, and re-scored the existing transcripts against it. Re-scores are fast and don't bill a single call, so the baseline cost minutes, not a round of voice traffic. The verdict on that prior run: zero of five conversations ended cleanly. Trailing offers, duplicate goodbyes, post-goodbye boilerplate. Directional rather than same-harness, since the earlier run probed different scenarios, but it told the loop exactly what to build: a wrap-up Scenario with a satisfied-caller Driver and four probes, run fresh from cycle one.
Cycle 1 found a prompt problem. The agent's closing policy literally mandated ending every response with "one concrete next step," and had no goodbye rule at all. The agent wasn't being chatty. It was obeying. The proposed change replaced that section with an ending policy: one short close, nothing after it. Approved, applied, re-run. Three of four closings were now correct in content, but the calls still didn't end.
Cycle 2 found the real problem, and it wasn't in the prompt. Reading the transcripts, the loop's diagnosis was that the agent had no hang-up mechanism at all. End-call was disabled in the platform config, no end-call phrases were registered, and a canned "Goodbye." teardown message was being re-spoken after the agent had already said goodbye. No prompt edit could ever fix that. The proposed change crossed into platform configuration, and when the platform's MCP didn't expose those fields, the loop went through the platform's API directly. After approval: four of four clean endings, verified against raw transcripts. Average call time dropped from about 3.1 to 2.5 minutes, because the agent stopped billing dead air after the conversation was over.
Cycle 3 audited the referee instead of the agent. The raw Check score came back low even though the transcripts showed four clean endings. The loop's verdict comes from reading the conversation analysis, not from chasing the number, and the diagnosis was that the score was a measurement artifact: the final message was duplicated in the history the judges were rendering. The change that cycle was to the Check, not the agent. A loop that only optimized the metric would have "improved" the agent against a broken ruler.
One sharp edge, named plainly: the hang-up now triggers on specific phrases, so any agent utterance containing "goodbye" ends the call. The prompt confines that word to closings, but it's a real constraint of the mechanism, not something we'd hide in a footnote.
The off-topic objective you saw in the command above ran the same way: same one-sentence invocation, different behavior, different fix surface. One pattern, two resolved behaviors.
Evidence, not vibes. And notice what the loop actually spent: a free re-score for the baseline, one simulation per cycle, and minutes of your attention at each approval gate.
The full utility here is not proven yet. We're not claiming it is. What we're claiming is narrower and checkable: the loop closes, on real agents, on real behavior objectives, today, and the fixes it finds aren't confined to the prompt. The promise of where it goes is bigger than that.
The referee neither agent can bribe
If a co-pilot is grading its own homework, none of this means anything. An agent improving an agent needs a referee neither of them can bribe.
So the Skill runs inside four guardrails:
Supervised by default. Every proposed change pauses for your approval before it's applied. Walk away from your desk and nothing ships. Auto mode exists, running all the cycles unattended, but you opt into it explicitly, after you've watched the loop make the right kinds of changes. Trust is earned cycle by cycle, not assumed.
A hard cycle count. Five cycles means five. And because voice targets place real, billed calls, the Skill confirms that with you once, up front, before anything dials.
A referee it cannot write to. Every score comes from a real evaluation run on Okareo's side. The co-pilot reads Scorecards; it cannot author them. If a tool fails or a run doesn't complete, the Skill reports exactly that and stops. It is forbidden from papering over a gap with an estimated result. And the verdict on each cycle comes from the written analysis of the conversations, not from a single number an optimizer could learn to chase. Goodhart's Law is not a footnote here. It's the design constraint.
The same bar, every cycle. The identical harness re-runs each cycle, and the whole Scorecard is compared, not just the target metric. A fix that improves the objective while regressing something that used to pass is not a clean win, and a regression on a previously passing dimension stops the loop: revert, re-frame, or stop. And if the bar itself turns out to be broken, fixing the bar is a logged cycle of its own, audited against raw transcripts, never a quiet adjustment.
None of this makes the loop slower in practice. It makes the results mean something.
From one loop to a system
The same loop that runs in your editor runs everywhere your agent does. Fast regression checks gate every pull request in CI. Nightly sweeps run the full suite. Pre-release, you run the adversarial Scenarios. And in production, Synthetic Production Monitoring has a Driver call your live agent on a schedule and report back, so you hear about drift before your customers do.
Then the flywheel closes: failed real production calls become tomorrow's Scenarios. Production feeds development.
That's the actual destination. Not a tool that fixes one prompt. A system where agent quality stops depending on heroic engineers reading transcripts at midnight and becomes something your pipeline does, with your judgment at every gate.
What we have today is the first turn of that flywheel, and it's working. Teams are running it now. The rest is in motion.
Or just ask:
/plugin marketplace add okareo-ai/okareo-tools /plugin install okareo@okareo-tools /okareo:improve 5 cycles "your hardest behavior problem here"
Get started at okareo.com/mcp. The plugin, the Skills, and the MCP server are at github.com/okareo-ai/okareo-tools.
Simulation finds the failure. Claude proposes the fix. You approve the change.
How well does your agent work?
How do you know? How many scenarios have you actually tried? Ten? Fifty? The same five happy paths your team has been replaying since the pilot?
If you're like most teams shipping agents right now, the honest answer is: you find out in production.
The cost of finding out in production
Agent failures are not stack traces. There is no exception to catch, no 500 to alert on. The failures that matter are behaviors: an agent that gets pulled off-policy when a caller pushes, tone that drifts over a long conversation, a tool called with hallucinated arguments, a refusal that should have happened and quietly didn't.
Observability sees all of this. After the fact. After a real customer lived it.
That gap has a price, and it's not a technical one. Every off-policy answer in production is churn risk you paid to discover. Every quiet failure is an escalation, a compliance question, a screenshot on someone's slide. Enterprises are deploying agents faster than they can verify them, and the verification debt lands on the customer experience.
Simulation in 60 seconds
The alternative is to find these behaviors before customers do. That's what simulation is for.
A Driver is a synthetic user with a personality, context, and a goal. A Target is the agent under test. A Scenario puts them together, and the Driver actually converses with your agent. For voice agents, it places real calls. Checks score what happened, and the results land in a Scorecard.
Thirty Drivers can try thirty things your QA script never thought of. Not like scripts. Like users.
Finding the failure is step one of five
Simulation tells you your agent gets pulled off-topic by 40 percent of pushy callers. Good. Now what?
Now someone reads the failing transcripts. Diagnoses the pattern. Edits the prompt or the flow. Redeploys. Re-runs the simulation. Compares Scorecards. Decides whether it actually got better or just got different.
That loop is an afternoon per cycle. Days per iteration. And it's your most critical team member doing it, because they're the only ones who can hold the whole loop in their heads.
This is the part nobody puts on the architecture diagram: the bottleneck in agent quality isn't detection. It's the remediation loop.
Hand the loop to your co-pilot
So we handed it to the co-pilot.
/okareo:improve ships in the Okareo plugin for Claude Code: one install gets you the Okareo MCP server and the Skills that know how to use it. You type one sentence:
/okareo:improve 5 cycles "How the agent deflects when a non-relevant topic is raised"
A cycle count. A behavior objective in plain language.
Before anything runs, the Skill makes you frame the problem. What does success look like, written down. What counts as a failure. And where this agent gets edited, which the Skill calls the Edit Target: a repo Claude edits directly, an MCP tool for your agent platform, or a diff you apply yourself. The loop refuses to start without a success definition. A loop without one produces unscored runs and no decision.
Then each cycle: Claude calls run_simulation through the Okareo MCP and reads the results as structured data, not as a dashboard. It pulls the failing transcripts, names a single root cause, and proposes exactly one change as a Change Spec: the edit, the rationale, and a before/after snapshot of the config. One change per cycle, or the before/after isn't attributable. Then it pauses and waits for you.
You approve. The change goes out through the Edit Target, the same harness re-runs, same Drivers, same Scenarios, same Checks, and the Scorecards get compared. Every cycle lands in a ledger in your repo, so the trend is a file, not a memory. Repeat until the behavior is resolved or the cycles run out.
The prerequisite is deliberately small: the Okareo MCP, plus any way to change and deploy your agent. The Skill embeds no knowledge of any agent platform. The edit is delegated to wherever your agent actually lives, so if you can change it and ship it programmatically, the loop can run.
This isn't a feature of one agent platform. It's what happens when you give a coding agent a simulation tool and a way to deploy.
What's working today
Teams are running this in their pipelines today. Here's one of our own recent runs, end to end.
The target was a voice agent that answers documentation questions. The behavior: end the call cleanly once the caller is satisfied. You know this failure because you've lived it as a caller. The agent answers your question, you say thanks, and then it offers two more things, says goodbye twice, and reads a canned closing after you've mentally hung up.
The loop didn't start by placing calls. It started from evidence we already had: it took a run from an earlier simulation against the same agent, defined a new Clean Call Ending Check, and re-scored the existing transcripts against it. Re-scores are fast and don't bill a single call, so the baseline cost minutes, not a round of voice traffic. The verdict on that prior run: zero of five conversations ended cleanly. Trailing offers, duplicate goodbyes, post-goodbye boilerplate. Directional rather than same-harness, since the earlier run probed different scenarios, but it told the loop exactly what to build: a wrap-up Scenario with a satisfied-caller Driver and four probes, run fresh from cycle one.
Cycle 1 found a prompt problem. The agent's closing policy literally mandated ending every response with "one concrete next step," and had no goodbye rule at all. The agent wasn't being chatty. It was obeying. The proposed change replaced that section with an ending policy: one short close, nothing after it. Approved, applied, re-run. Three of four closings were now correct in content, but the calls still didn't end.
Cycle 2 found the real problem, and it wasn't in the prompt. Reading the transcripts, the loop's diagnosis was that the agent had no hang-up mechanism at all. End-call was disabled in the platform config, no end-call phrases were registered, and a canned "Goodbye." teardown message was being re-spoken after the agent had already said goodbye. No prompt edit could ever fix that. The proposed change crossed into platform configuration, and when the platform's MCP didn't expose those fields, the loop went through the platform's API directly. After approval: four of four clean endings, verified against raw transcripts. Average call time dropped from about 3.1 to 2.5 minutes, because the agent stopped billing dead air after the conversation was over.
Cycle 3 audited the referee instead of the agent. The raw Check score came back low even though the transcripts showed four clean endings. The loop's verdict comes from reading the conversation analysis, not from chasing the number, and the diagnosis was that the score was a measurement artifact: the final message was duplicated in the history the judges were rendering. The change that cycle was to the Check, not the agent. A loop that only optimized the metric would have "improved" the agent against a broken ruler.
One sharp edge, named plainly: the hang-up now triggers on specific phrases, so any agent utterance containing "goodbye" ends the call. The prompt confines that word to closings, but it's a real constraint of the mechanism, not something we'd hide in a footnote.
The off-topic objective you saw in the command above ran the same way: same one-sentence invocation, different behavior, different fix surface. One pattern, two resolved behaviors.
Evidence, not vibes. And notice what the loop actually spent: a free re-score for the baseline, one simulation per cycle, and minutes of your attention at each approval gate.
The full utility here is not proven yet. We're not claiming it is. What we're claiming is narrower and checkable: the loop closes, on real agents, on real behavior objectives, today, and the fixes it finds aren't confined to the prompt. The promise of where it goes is bigger than that.
The referee neither agent can bribe
If a co-pilot is grading its own homework, none of this means anything. An agent improving an agent needs a referee neither of them can bribe.
So the Skill runs inside four guardrails:
Supervised by default. Every proposed change pauses for your approval before it's applied. Walk away from your desk and nothing ships. Auto mode exists, running all the cycles unattended, but you opt into it explicitly, after you've watched the loop make the right kinds of changes. Trust is earned cycle by cycle, not assumed.
A hard cycle count. Five cycles means five. And because voice targets place real, billed calls, the Skill confirms that with you once, up front, before anything dials.
A referee it cannot write to. Every score comes from a real evaluation run on Okareo's side. The co-pilot reads Scorecards; it cannot author them. If a tool fails or a run doesn't complete, the Skill reports exactly that and stops. It is forbidden from papering over a gap with an estimated result. And the verdict on each cycle comes from the written analysis of the conversations, not from a single number an optimizer could learn to chase. Goodhart's Law is not a footnote here. It's the design constraint.
The same bar, every cycle. The identical harness re-runs each cycle, and the whole Scorecard is compared, not just the target metric. A fix that improves the objective while regressing something that used to pass is not a clean win, and a regression on a previously passing dimension stops the loop: revert, re-frame, or stop. And if the bar itself turns out to be broken, fixing the bar is a logged cycle of its own, audited against raw transcripts, never a quiet adjustment.
None of this makes the loop slower in practice. It makes the results mean something.
From one loop to a system
The same loop that runs in your editor runs everywhere your agent does. Fast regression checks gate every pull request in CI. Nightly sweeps run the full suite. Pre-release, you run the adversarial Scenarios. And in production, Synthetic Production Monitoring has a Driver call your live agent on a schedule and report back, so you hear about drift before your customers do.
Then the flywheel closes: failed real production calls become tomorrow's Scenarios. Production feeds development.
That's the actual destination. Not a tool that fixes one prompt. A system where agent quality stops depending on heroic engineers reading transcripts at midnight and becomes something your pipeline does, with your judgment at every gate.
What we have today is the first turn of that flywheel, and it's working. Teams are running it now. The rest is in motion.
Or just ask:
/plugin marketplace add okareo-ai/okareo-tools /plugin install okareo@okareo-tools /okareo:improve 5 cycles "your hardest behavior problem here"
Get started at okareo.com/mcp. The plugin, the Skills, and the MCP server are at github.com/okareo-ai/okareo-tools.
Simulation finds the failure. Claude proposes the fix. You approve the change.
How well does your agent work?
How do you know? How many scenarios have you actually tried? Ten? Fifty? The same five happy paths your team has been replaying since the pilot?
If you're like most teams shipping agents right now, the honest answer is: you find out in production.
The cost of finding out in production
Agent failures are not stack traces. There is no exception to catch, no 500 to alert on. The failures that matter are behaviors: an agent that gets pulled off-policy when a caller pushes, tone that drifts over a long conversation, a tool called with hallucinated arguments, a refusal that should have happened and quietly didn't.
Observability sees all of this. After the fact. After a real customer lived it.
That gap has a price, and it's not a technical one. Every off-policy answer in production is churn risk you paid to discover. Every quiet failure is an escalation, a compliance question, a screenshot on someone's slide. Enterprises are deploying agents faster than they can verify them, and the verification debt lands on the customer experience.
Simulation in 60 seconds
The alternative is to find these behaviors before customers do. That's what simulation is for.
A Driver is a synthetic user with a personality, context, and a goal. A Target is the agent under test. A Scenario puts them together, and the Driver actually converses with your agent. For voice agents, it places real calls. Checks score what happened, and the results land in a Scorecard.
Thirty Drivers can try thirty things your QA script never thought of. Not like scripts. Like users.
Finding the failure is step one of five
Simulation tells you your agent gets pulled off-topic by 40 percent of pushy callers. Good. Now what?
Now someone reads the failing transcripts. Diagnoses the pattern. Edits the prompt or the flow. Redeploys. Re-runs the simulation. Compares Scorecards. Decides whether it actually got better or just got different.
That loop is an afternoon per cycle. Days per iteration. And it's your most critical team member doing it, because they're the only ones who can hold the whole loop in their heads.
This is the part nobody puts on the architecture diagram: the bottleneck in agent quality isn't detection. It's the remediation loop.
Hand the loop to your co-pilot
So we handed it to the co-pilot.
/okareo:improve ships in the Okareo plugin for Claude Code: one install gets you the Okareo MCP server and the Skills that know how to use it. You type one sentence:
/okareo:improve 5 cycles "How the agent deflects when a non-relevant topic is raised"
A cycle count. A behavior objective in plain language.
Before anything runs, the Skill makes you frame the problem. What does success look like, written down. What counts as a failure. And where this agent gets edited, which the Skill calls the Edit Target: a repo Claude edits directly, an MCP tool for your agent platform, or a diff you apply yourself. The loop refuses to start without a success definition. A loop without one produces unscored runs and no decision.
Then each cycle: Claude calls run_simulation through the Okareo MCP and reads the results as structured data, not as a dashboard. It pulls the failing transcripts, names a single root cause, and proposes exactly one change as a Change Spec: the edit, the rationale, and a before/after snapshot of the config. One change per cycle, or the before/after isn't attributable. Then it pauses and waits for you.
You approve. The change goes out through the Edit Target, the same harness re-runs, same Drivers, same Scenarios, same Checks, and the Scorecards get compared. Every cycle lands in a ledger in your repo, so the trend is a file, not a memory. Repeat until the behavior is resolved or the cycles run out.
The prerequisite is deliberately small: the Okareo MCP, plus any way to change and deploy your agent. The Skill embeds no knowledge of any agent platform. The edit is delegated to wherever your agent actually lives, so if you can change it and ship it programmatically, the loop can run.
This isn't a feature of one agent platform. It's what happens when you give a coding agent a simulation tool and a way to deploy.
What's working today
Teams are running this in their pipelines today. Here's one of our own recent runs, end to end.
The target was a voice agent that answers documentation questions. The behavior: end the call cleanly once the caller is satisfied. You know this failure because you've lived it as a caller. The agent answers your question, you say thanks, and then it offers two more things, says goodbye twice, and reads a canned closing after you've mentally hung up.
The loop didn't start by placing calls. It started from evidence we already had: it took a run from an earlier simulation against the same agent, defined a new Clean Call Ending Check, and re-scored the existing transcripts against it. Re-scores are fast and don't bill a single call, so the baseline cost minutes, not a round of voice traffic. The verdict on that prior run: zero of five conversations ended cleanly. Trailing offers, duplicate goodbyes, post-goodbye boilerplate. Directional rather than same-harness, since the earlier run probed different scenarios, but it told the loop exactly what to build: a wrap-up Scenario with a satisfied-caller Driver and four probes, run fresh from cycle one.
Cycle 1 found a prompt problem. The agent's closing policy literally mandated ending every response with "one concrete next step," and had no goodbye rule at all. The agent wasn't being chatty. It was obeying. The proposed change replaced that section with an ending policy: one short close, nothing after it. Approved, applied, re-run. Three of four closings were now correct in content, but the calls still didn't end.
Cycle 2 found the real problem, and it wasn't in the prompt. Reading the transcripts, the loop's diagnosis was that the agent had no hang-up mechanism at all. End-call was disabled in the platform config, no end-call phrases were registered, and a canned "Goodbye." teardown message was being re-spoken after the agent had already said goodbye. No prompt edit could ever fix that. The proposed change crossed into platform configuration, and when the platform's MCP didn't expose those fields, the loop went through the platform's API directly. After approval: four of four clean endings, verified against raw transcripts. Average call time dropped from about 3.1 to 2.5 minutes, because the agent stopped billing dead air after the conversation was over.
Cycle 3 audited the referee instead of the agent. The raw Check score came back low even though the transcripts showed four clean endings. The loop's verdict comes from reading the conversation analysis, not from chasing the number, and the diagnosis was that the score was a measurement artifact: the final message was duplicated in the history the judges were rendering. The change that cycle was to the Check, not the agent. A loop that only optimized the metric would have "improved" the agent against a broken ruler.
One sharp edge, named plainly: the hang-up now triggers on specific phrases, so any agent utterance containing "goodbye" ends the call. The prompt confines that word to closings, but it's a real constraint of the mechanism, not something we'd hide in a footnote.
The off-topic objective you saw in the command above ran the same way: same one-sentence invocation, different behavior, different fix surface. One pattern, two resolved behaviors.
Evidence, not vibes. And notice what the loop actually spent: a free re-score for the baseline, one simulation per cycle, and minutes of your attention at each approval gate.
The full utility here is not proven yet. We're not claiming it is. What we're claiming is narrower and checkable: the loop closes, on real agents, on real behavior objectives, today, and the fixes it finds aren't confined to the prompt. The promise of where it goes is bigger than that.
The referee neither agent can bribe
If a co-pilot is grading its own homework, none of this means anything. An agent improving an agent needs a referee neither of them can bribe.
So the Skill runs inside four guardrails:
Supervised by default. Every proposed change pauses for your approval before it's applied. Walk away from your desk and nothing ships. Auto mode exists, running all the cycles unattended, but you opt into it explicitly, after you've watched the loop make the right kinds of changes. Trust is earned cycle by cycle, not assumed.
A hard cycle count. Five cycles means five. And because voice targets place real, billed calls, the Skill confirms that with you once, up front, before anything dials.
A referee it cannot write to. Every score comes from a real evaluation run on Okareo's side. The co-pilot reads Scorecards; it cannot author them. If a tool fails or a run doesn't complete, the Skill reports exactly that and stops. It is forbidden from papering over a gap with an estimated result. And the verdict on each cycle comes from the written analysis of the conversations, not from a single number an optimizer could learn to chase. Goodhart's Law is not a footnote here. It's the design constraint.
The same bar, every cycle. The identical harness re-runs each cycle, and the whole Scorecard is compared, not just the target metric. A fix that improves the objective while regressing something that used to pass is not a clean win, and a regression on a previously passing dimension stops the loop: revert, re-frame, or stop. And if the bar itself turns out to be broken, fixing the bar is a logged cycle of its own, audited against raw transcripts, never a quiet adjustment.
None of this makes the loop slower in practice. It makes the results mean something.
From one loop to a system
The same loop that runs in your editor runs everywhere your agent does. Fast regression checks gate every pull request in CI. Nightly sweeps run the full suite. Pre-release, you run the adversarial Scenarios. And in production, Synthetic Production Monitoring has a Driver call your live agent on a schedule and report back, so you hear about drift before your customers do.
Then the flywheel closes: failed real production calls become tomorrow's Scenarios. Production feeds development.
That's the actual destination. Not a tool that fixes one prompt. A system where agent quality stops depending on heroic engineers reading transcripts at midnight and becomes something your pipeline does, with your judgment at every gate.
What we have today is the first turn of that flywheel, and it's working. Teams are running it now. The rest is in motion.
Or just ask:
/plugin marketplace add okareo-ai/okareo-tools /plugin install okareo@okareo-tools /okareo:improve 5 cycles "your hardest behavior problem here"
Get started at okareo.com/mcp. The plugin, the Skills, and the MCP server are at github.com/okareo-ai/okareo-tools.



