Multi-Turn Simulation: Part 2

Evaluation

Matt Wyman

,

Co-Founder & CEO

July 14, 2025

Use a Simulation to Test an AI Agent

In Part 1, we introduced why multi-turn conversations are essential for evaluating AI agents. Now let’s get hands-on with the how. In this post, we’ll walk through setting up a multi-turn simulation: defining a user persona, configuring your AI as a testable “Target,” and running a simulated conversation with automated checks. Whether you’re a developer integrating with OpenAI or Google’s Gemini, or a product manager looking at a UI, these principles apply.

By the end of this guide, you’ll know how to simulate a full conversation with your agent in a repeatable way. We’ll illustrate with examples using Okareo’s framework – but the concepts can be adapted to other tools as well.

Before we go too far, this entire blog is available as a Collab Notebook using the Okareo SDK. Everything we do is available via UI and SDK.

Basic Simulation Collab Notebook

1. Define your Target (the Agent under test)

First, identify what you want to test. In simulation terms, the AI agent is called the Target. This could be a large language model from a provider (like gpt-4 via an API) or your own deployed chat service. In Okareo, you would create a Target profile that tells the simulator how to invoke your agent. There are two primary modes:

  • Prompt-Based Target: If your agent is essentially a base LLM (possibly with a system prompt to set its behavior), you don’t need any custom code. You just select the model (e.g., “GPT-4 (OpenAI)” or “Gemini (Google)”) and provide any system prompt or parameters (like temperature). The simulation platform will handle sending messages to this model via the provider API. For example, you might specify a system prompt like “You are a customer support assistant for Acme Inc. Always follow the company policy provided.” when configuring the Target – this frames how the model should behave during the test.

  • Custom Endpoint Target: If your AI agent is more than a single model call – say it’s a tool-using agent, a RAG (Retrieval-Augmented Generation) pipeline, a full Agentic network, or any custom backend – you can still simulate it. In this case, you tell the simulator how to talk to your system over HTTP. You provide the endpoint URL, HTTP method, headers (for auth, etc.), and templates for the request body and how to extract the response. Essentially, you map the conversation protocol to your API: for example, Start Session might map to /start_session and return a session ID; Next Turn might map to /next_message with the user’s text and the assistant’s reply as a response; and finally End Session might map to /end_session allowing you to free up resources. Okareo uses this mapping to call your service turn by turn.

If that looks complex, don’t worry – for a simple use case you will likely stick with a prompt-based target using a single LLM. Custom endpoints are there for when you have a more complex system to test. The end result in both cases is the simulator knows how to get the AI’s response given a user message.

2. Define the Driver (Synthetic User Persona)

Next, define who or what the agent will be conversing with. This is the Driver – the simulated user. In a multi-turn simulation, the Driver is essentially another language model that we prompt to behave like a certain persona. You craft a Driver prompt that tells this model what role to play, what its goals are, and how to interact.

Defining a good Driver persona is crucial. At minimum, you should specify:

  • Persona/Role: Who is the user? e.g. “A customer who is frustrated about a billing error,” or “A curious new user trying to explore the product’s limits.”

  • Goal or Task: What does the user want to achieve? e.g. “Get a refund for a $50 charge they didn’t authorize,” or “Find out if the assistant will break policy by naming a competitor.”

  • Behavioral style: How will they behave? Polite or aggressive? Do they ask many questions or give one-word prompts? Any specific tactics?

3. Define the Simulation Goals (Scenario)

Now comes the fun part. To run a simulation, you need to give your synthetic users goals. To do so, you use Okareo’s scenario capabilities to capture information about what the user will do and the expected result to judge the outcome of the conversation.

Either through SDK or Okareo’s UI, you would create a Scenario with one or more simulation “rows” – each row includes a set of inputs for the Driver and an expected outcome pairing. For example, one scenario might include an objective for the driver like “Return an unwanted pair of shoes.” with an expected outcome “Agent explains the return policy and offers a shipping label.” Another row could be “Return a pair of used shoes.” with a paired outcome of “Agent explains that the return policy is only valid for unused items.”  The same scenario can be used with multiple Drivers – a friendly Driver, an irate Driver, and so on. Differences in the Agent response provide guidance for tuning and fixing the Agent. Each scenario row serves as a starting point for a separate simulation run.

Here, each row defines an objective for the synthetic user along with the desired outcome. The specific utterances will be up to the selected Driver persona. The platform will use these to manage conversations and later evaluate if the outcome was achieved.

Under the hood, these descriptions also guide the Checks (more on those soon). For instance, “offers a return shipping label” could correspond to a Task Completion check that looks for whether the assistant’s replies included a return label or instructions.

4. Running the Simulation (Conversation Loop)

With a Target defined (the agent or model), a simulation scenario defined, and a Driver persona ready, it’s time to run a simulation. This is where the magic happens – the two sides will interact turn-by-turn automatically.

If you’re in the Okareo UI, you’d start a “New Simulation” on the Simulations tab, selecting the correct target/driver "Settings Profile" (e.g. “Basic Simulation Driver”), the Scenario (e.g. “Return Scenarios”), the checks you want to apply, and then hit Run. Each row in the scenario will run as a separate conversation instance. 

In this example, “Basic Simulation Driver” is a MultiTurnDriver model that encapsulates our driver persona and target (this might be set up when we registered the Target + Driver earlier). We specify the scenario to use, indicate this is a MULTI_TURN type test, and enable a few checks (more on those next). The result provides a link or object where we can inspect what happened.

During the run, the conversation will continue until a stop condition is reached. A stop condition could be a maximum number of turns (say 6 exchanges each), or a specific check signaling completion (for example, stop when the “task_completed” check returns true, meaning the user’s goal was met). You can configure that – e.g., stop after the “behavior_adherence” check fails, if you want to immediately end when the agent goes off-script. Otherwise, it runs the full script.

5. Checks and Evaluation Metrics

Now, the evaluation part: once the conversation simulation finishes, how do we know if the agent did well? This is where Checks come in. Checks are like test assertions or metrics that are evaluated on the conversation transcript. Okareo provides built-in checks for common concerns, and you can add custom ones.

Some built-in checks include:

  • Behavior Adherence: Did the AI stay in character and follow the instructions/policy it was given? (For example, if your system prompt says the AI should never reveal certain info or should always use a certain tone, this check looks for violations.)

  • Model Refusal: Did the AI appropriately refuse any user requests that it was supposed to refuse? (E.g. if the user asks for something disallowed, the agent should respond with a refusal. This check flags if the agent complied when it should have said no).

  • Simulation Task Completed: Did the conversation achieve the user’s main goal? For instance, if the user wanted a refund, did the agent actually initiate a refund or give the necessary info by the end? If the user asked a question, did they get a correct answer? This usually relies on either a heuristic or a specified expected result (like the result text we provided in the scenario seeds).

  • Offensive/Policy Compliance: (Depending on platform) You might also have checks ensuring no toxic language, or no privacy violations, etc., if those are concerns.


And of course, Custom Checks: you can write code to analyze the transcript for your own criteria. For example, a custom check might scan all agent responses to ensure no discount codes were given (if that’s against policy for our scenario), or to verify the agent provided a citation when answering from a knowledge base.

When you run the simulation, these checks are evaluated at each turn and/or at the end. The output is an evaluation report where you can see each conversation along with a pass/fail or score for each check. Failing checks will have explanations that you can use as guidance to fix your Agent.

So, coming back to our examples:

  • If the frustrated customer simulation results in the agent eventually saying “You know what, this is your fault” (just as a bad example), the Behavior Adherence check would flag that response as breaking the polite tone rule.

  • If the refund scenario ends and the agent never actually confirmed a refund, the Task Completed check might fail because the expected outcome (“initiates a refund”) didn’t happen.

  • If our scenario included a user asking for a competitor’s name (which is disallowed by policy) and the agent gave one, the Model Refusal check would flag that as a failure.

All this happens automatically – no need to read through transcripts manually. You get a quick view of where your agent stands against each scenario.

6. Inspect and Iterate

Running the simulation is not the end – it’s a feedback loop. Once you have results, dig into the transcripts for any failed checks. The beauty of having the full conversation recorded is you can see exactly what the AI said and why a check triggered. Maybe you find that the agent’s wording was technically correct but sounded rude (leading to a behavior check failure) – you might then adjust your system prompt to emphasize politeness. Or you discover the agent gave a generic answer where a specific one was needed – perhaps your knowledge base didn’t have that info, so you consider adding data or tweaking the prompt for that scenario.

Because the simulation is easy to re-run, you can fix the issue and test again quickly. This is a huge improvement over waiting for user complaints or combing through logs. Many teams integrate these simulations into their CI/CD pipeline – for example, running a suite of 50 conversation scenarios every time the chatbot’s code or training data is updated, to catch regressions immediately.

With these steps – target setup, driver persona, running simulation, and analyzing checks – you have a repeatable method to validate your conversational agent’s behavior in depth. You can expand your scenario set over time as new edge cases are discovered. In practice, teams often start with a handful of critical scenarios (like “angry customer”, “policy test”, “random chit-chat”) and gradually add more. The result is a safety net of multi-turn tests that give you confidence in your AI before it faces real users.

In Part 3, we’ll explore advanced techniques to get even more out of multi-turn simulations. This includes crafting highly effective Driver prompts (to really stress your agent), using simulations for complex multi-step agents (with tools or knowledge lookups), and best practices for integrating these tests into your development workflow. Stay tuned!

Use a Simulation to Test an AI Agent

In Part 1, we introduced why multi-turn conversations are essential for evaluating AI agents. Now let’s get hands-on with the how. In this post, we’ll walk through setting up a multi-turn simulation: defining a user persona, configuring your AI as a testable “Target,” and running a simulated conversation with automated checks. Whether you’re a developer integrating with OpenAI or Google’s Gemini, or a product manager looking at a UI, these principles apply.

By the end of this guide, you’ll know how to simulate a full conversation with your agent in a repeatable way. We’ll illustrate with examples using Okareo’s framework – but the concepts can be adapted to other tools as well.

Before we go too far, this entire blog is available as a Collab Notebook using the Okareo SDK. Everything we do is available via UI and SDK.

Basic Simulation Collab Notebook

1. Define your Target (the Agent under test)

First, identify what you want to test. In simulation terms, the AI agent is called the Target. This could be a large language model from a provider (like gpt-4 via an API) or your own deployed chat service. In Okareo, you would create a Target profile that tells the simulator how to invoke your agent. There are two primary modes:

  • Prompt-Based Target: If your agent is essentially a base LLM (possibly with a system prompt to set its behavior), you don’t need any custom code. You just select the model (e.g., “GPT-4 (OpenAI)” or “Gemini (Google)”) and provide any system prompt or parameters (like temperature). The simulation platform will handle sending messages to this model via the provider API. For example, you might specify a system prompt like “You are a customer support assistant for Acme Inc. Always follow the company policy provided.” when configuring the Target – this frames how the model should behave during the test.

  • Custom Endpoint Target: If your AI agent is more than a single model call – say it’s a tool-using agent, a RAG (Retrieval-Augmented Generation) pipeline, a full Agentic network, or any custom backend – you can still simulate it. In this case, you tell the simulator how to talk to your system over HTTP. You provide the endpoint URL, HTTP method, headers (for auth, etc.), and templates for the request body and how to extract the response. Essentially, you map the conversation protocol to your API: for example, Start Session might map to /start_session and return a session ID; Next Turn might map to /next_message with the user’s text and the assistant’s reply as a response; and finally End Session might map to /end_session allowing you to free up resources. Okareo uses this mapping to call your service turn by turn.

If that looks complex, don’t worry – for a simple use case you will likely stick with a prompt-based target using a single LLM. Custom endpoints are there for when you have a more complex system to test. The end result in both cases is the simulator knows how to get the AI’s response given a user message.

2. Define the Driver (Synthetic User Persona)

Next, define who or what the agent will be conversing with. This is the Driver – the simulated user. In a multi-turn simulation, the Driver is essentially another language model that we prompt to behave like a certain persona. You craft a Driver prompt that tells this model what role to play, what its goals are, and how to interact.

Defining a good Driver persona is crucial. At minimum, you should specify:

  • Persona/Role: Who is the user? e.g. “A customer who is frustrated about a billing error,” or “A curious new user trying to explore the product’s limits.”

  • Goal or Task: What does the user want to achieve? e.g. “Get a refund for a $50 charge they didn’t authorize,” or “Find out if the assistant will break policy by naming a competitor.”

  • Behavioral style: How will they behave? Polite or aggressive? Do they ask many questions or give one-word prompts? Any specific tactics?

3. Define the Simulation Goals (Scenario)

Now comes the fun part. To run a simulation, you need to give your synthetic users goals. To do so, you use Okareo’s scenario capabilities to capture information about what the user will do and the expected result to judge the outcome of the conversation.

Either through SDK or Okareo’s UI, you would create a Scenario with one or more simulation “rows” – each row includes a set of inputs for the Driver and an expected outcome pairing. For example, one scenario might include an objective for the driver like “Return an unwanted pair of shoes.” with an expected outcome “Agent explains the return policy and offers a shipping label.” Another row could be “Return a pair of used shoes.” with a paired outcome of “Agent explains that the return policy is only valid for unused items.”  The same scenario can be used with multiple Drivers – a friendly Driver, an irate Driver, and so on. Differences in the Agent response provide guidance for tuning and fixing the Agent. Each scenario row serves as a starting point for a separate simulation run.

Here, each row defines an objective for the synthetic user along with the desired outcome. The specific utterances will be up to the selected Driver persona. The platform will use these to manage conversations and later evaluate if the outcome was achieved.

Under the hood, these descriptions also guide the Checks (more on those soon). For instance, “offers a return shipping label” could correspond to a Task Completion check that looks for whether the assistant’s replies included a return label or instructions.

4. Running the Simulation (Conversation Loop)

With a Target defined (the agent or model), a simulation scenario defined, and a Driver persona ready, it’s time to run a simulation. This is where the magic happens – the two sides will interact turn-by-turn automatically.

If you’re in the Okareo UI, you’d start a “New Simulation” on the Simulations tab, selecting the correct target/driver "Settings Profile" (e.g. “Basic Simulation Driver”), the Scenario (e.g. “Return Scenarios”), the checks you want to apply, and then hit Run. Each row in the scenario will run as a separate conversation instance. 

In this example, “Basic Simulation Driver” is a MultiTurnDriver model that encapsulates our driver persona and target (this might be set up when we registered the Target + Driver earlier). We specify the scenario to use, indicate this is a MULTI_TURN type test, and enable a few checks (more on those next). The result provides a link or object where we can inspect what happened.

During the run, the conversation will continue until a stop condition is reached. A stop condition could be a maximum number of turns (say 6 exchanges each), or a specific check signaling completion (for example, stop when the “task_completed” check returns true, meaning the user’s goal was met). You can configure that – e.g., stop after the “behavior_adherence” check fails, if you want to immediately end when the agent goes off-script. Otherwise, it runs the full script.

5. Checks and Evaluation Metrics

Now, the evaluation part: once the conversation simulation finishes, how do we know if the agent did well? This is where Checks come in. Checks are like test assertions or metrics that are evaluated on the conversation transcript. Okareo provides built-in checks for common concerns, and you can add custom ones.

Some built-in checks include:

  • Behavior Adherence: Did the AI stay in character and follow the instructions/policy it was given? (For example, if your system prompt says the AI should never reveal certain info or should always use a certain tone, this check looks for violations.)

  • Model Refusal: Did the AI appropriately refuse any user requests that it was supposed to refuse? (E.g. if the user asks for something disallowed, the agent should respond with a refusal. This check flags if the agent complied when it should have said no).

  • Simulation Task Completed: Did the conversation achieve the user’s main goal? For instance, if the user wanted a refund, did the agent actually initiate a refund or give the necessary info by the end? If the user asked a question, did they get a correct answer? This usually relies on either a heuristic or a specified expected result (like the result text we provided in the scenario seeds).

  • Offensive/Policy Compliance: (Depending on platform) You might also have checks ensuring no toxic language, or no privacy violations, etc., if those are concerns.


And of course, Custom Checks: you can write code to analyze the transcript for your own criteria. For example, a custom check might scan all agent responses to ensure no discount codes were given (if that’s against policy for our scenario), or to verify the agent provided a citation when answering from a knowledge base.

When you run the simulation, these checks are evaluated at each turn and/or at the end. The output is an evaluation report where you can see each conversation along with a pass/fail or score for each check. Failing checks will have explanations that you can use as guidance to fix your Agent.

So, coming back to our examples:

  • If the frustrated customer simulation results in the agent eventually saying “You know what, this is your fault” (just as a bad example), the Behavior Adherence check would flag that response as breaking the polite tone rule.

  • If the refund scenario ends and the agent never actually confirmed a refund, the Task Completed check might fail because the expected outcome (“initiates a refund”) didn’t happen.

  • If our scenario included a user asking for a competitor’s name (which is disallowed by policy) and the agent gave one, the Model Refusal check would flag that as a failure.

All this happens automatically – no need to read through transcripts manually. You get a quick view of where your agent stands against each scenario.

6. Inspect and Iterate

Running the simulation is not the end – it’s a feedback loop. Once you have results, dig into the transcripts for any failed checks. The beauty of having the full conversation recorded is you can see exactly what the AI said and why a check triggered. Maybe you find that the agent’s wording was technically correct but sounded rude (leading to a behavior check failure) – you might then adjust your system prompt to emphasize politeness. Or you discover the agent gave a generic answer where a specific one was needed – perhaps your knowledge base didn’t have that info, so you consider adding data or tweaking the prompt for that scenario.

Because the simulation is easy to re-run, you can fix the issue and test again quickly. This is a huge improvement over waiting for user complaints or combing through logs. Many teams integrate these simulations into their CI/CD pipeline – for example, running a suite of 50 conversation scenarios every time the chatbot’s code or training data is updated, to catch regressions immediately.

With these steps – target setup, driver persona, running simulation, and analyzing checks – you have a repeatable method to validate your conversational agent’s behavior in depth. You can expand your scenario set over time as new edge cases are discovered. In practice, teams often start with a handful of critical scenarios (like “angry customer”, “policy test”, “random chit-chat”) and gradually add more. The result is a safety net of multi-turn tests that give you confidence in your AI before it faces real users.

In Part 3, we’ll explore advanced techniques to get even more out of multi-turn simulations. This includes crafting highly effective Driver prompts (to really stress your agent), using simulations for complex multi-step agents (with tools or knowledge lookups), and best practices for integrating these tests into your development workflow. Stay tuned!

Use a Simulation to Test an AI Agent

In Part 1, we introduced why multi-turn conversations are essential for evaluating AI agents. Now let’s get hands-on with the how. In this post, we’ll walk through setting up a multi-turn simulation: defining a user persona, configuring your AI as a testable “Target,” and running a simulated conversation with automated checks. Whether you’re a developer integrating with OpenAI or Google’s Gemini, or a product manager looking at a UI, these principles apply.

By the end of this guide, you’ll know how to simulate a full conversation with your agent in a repeatable way. We’ll illustrate with examples using Okareo’s framework – but the concepts can be adapted to other tools as well.

Before we go too far, this entire blog is available as a Collab Notebook using the Okareo SDK. Everything we do is available via UI and SDK.

Basic Simulation Collab Notebook

1. Define your Target (the Agent under test)

First, identify what you want to test. In simulation terms, the AI agent is called the Target. This could be a large language model from a provider (like gpt-4 via an API) or your own deployed chat service. In Okareo, you would create a Target profile that tells the simulator how to invoke your agent. There are two primary modes:

  • Prompt-Based Target: If your agent is essentially a base LLM (possibly with a system prompt to set its behavior), you don’t need any custom code. You just select the model (e.g., “GPT-4 (OpenAI)” or “Gemini (Google)”) and provide any system prompt or parameters (like temperature). The simulation platform will handle sending messages to this model via the provider API. For example, you might specify a system prompt like “You are a customer support assistant for Acme Inc. Always follow the company policy provided.” when configuring the Target – this frames how the model should behave during the test.

  • Custom Endpoint Target: If your AI agent is more than a single model call – say it’s a tool-using agent, a RAG (Retrieval-Augmented Generation) pipeline, a full Agentic network, or any custom backend – you can still simulate it. In this case, you tell the simulator how to talk to your system over HTTP. You provide the endpoint URL, HTTP method, headers (for auth, etc.), and templates for the request body and how to extract the response. Essentially, you map the conversation protocol to your API: for example, Start Session might map to /start_session and return a session ID; Next Turn might map to /next_message with the user’s text and the assistant’s reply as a response; and finally End Session might map to /end_session allowing you to free up resources. Okareo uses this mapping to call your service turn by turn.

If that looks complex, don’t worry – for a simple use case you will likely stick with a prompt-based target using a single LLM. Custom endpoints are there for when you have a more complex system to test. The end result in both cases is the simulator knows how to get the AI’s response given a user message.

2. Define the Driver (Synthetic User Persona)

Next, define who or what the agent will be conversing with. This is the Driver – the simulated user. In a multi-turn simulation, the Driver is essentially another language model that we prompt to behave like a certain persona. You craft a Driver prompt that tells this model what role to play, what its goals are, and how to interact.

Defining a good Driver persona is crucial. At minimum, you should specify:

  • Persona/Role: Who is the user? e.g. “A customer who is frustrated about a billing error,” or “A curious new user trying to explore the product’s limits.”

  • Goal or Task: What does the user want to achieve? e.g. “Get a refund for a $50 charge they didn’t authorize,” or “Find out if the assistant will break policy by naming a competitor.”

  • Behavioral style: How will they behave? Polite or aggressive? Do they ask many questions or give one-word prompts? Any specific tactics?

3. Define the Simulation Goals (Scenario)

Now comes the fun part. To run a simulation, you need to give your synthetic users goals. To do so, you use Okareo’s scenario capabilities to capture information about what the user will do and the expected result to judge the outcome of the conversation.

Either through SDK or Okareo’s UI, you would create a Scenario with one or more simulation “rows” – each row includes a set of inputs for the Driver and an expected outcome pairing. For example, one scenario might include an objective for the driver like “Return an unwanted pair of shoes.” with an expected outcome “Agent explains the return policy and offers a shipping label.” Another row could be “Return a pair of used shoes.” with a paired outcome of “Agent explains that the return policy is only valid for unused items.”  The same scenario can be used with multiple Drivers – a friendly Driver, an irate Driver, and so on. Differences in the Agent response provide guidance for tuning and fixing the Agent. Each scenario row serves as a starting point for a separate simulation run.

Here, each row defines an objective for the synthetic user along with the desired outcome. The specific utterances will be up to the selected Driver persona. The platform will use these to manage conversations and later evaluate if the outcome was achieved.

Under the hood, these descriptions also guide the Checks (more on those soon). For instance, “offers a return shipping label” could correspond to a Task Completion check that looks for whether the assistant’s replies included a return label or instructions.

4. Running the Simulation (Conversation Loop)

With a Target defined (the agent or model), a simulation scenario defined, and a Driver persona ready, it’s time to run a simulation. This is where the magic happens – the two sides will interact turn-by-turn automatically.

If you’re in the Okareo UI, you’d start a “New Simulation” on the Simulations tab, selecting the correct target/driver "Settings Profile" (e.g. “Basic Simulation Driver”), the Scenario (e.g. “Return Scenarios”), the checks you want to apply, and then hit Run. Each row in the scenario will run as a separate conversation instance. 

In this example, “Basic Simulation Driver” is a MultiTurnDriver model that encapsulates our driver persona and target (this might be set up when we registered the Target + Driver earlier). We specify the scenario to use, indicate this is a MULTI_TURN type test, and enable a few checks (more on those next). The result provides a link or object where we can inspect what happened.

During the run, the conversation will continue until a stop condition is reached. A stop condition could be a maximum number of turns (say 6 exchanges each), or a specific check signaling completion (for example, stop when the “task_completed” check returns true, meaning the user’s goal was met). You can configure that – e.g., stop after the “behavior_adherence” check fails, if you want to immediately end when the agent goes off-script. Otherwise, it runs the full script.

5. Checks and Evaluation Metrics

Now, the evaluation part: once the conversation simulation finishes, how do we know if the agent did well? This is where Checks come in. Checks are like test assertions or metrics that are evaluated on the conversation transcript. Okareo provides built-in checks for common concerns, and you can add custom ones.

Some built-in checks include:

  • Behavior Adherence: Did the AI stay in character and follow the instructions/policy it was given? (For example, if your system prompt says the AI should never reveal certain info or should always use a certain tone, this check looks for violations.)

  • Model Refusal: Did the AI appropriately refuse any user requests that it was supposed to refuse? (E.g. if the user asks for something disallowed, the agent should respond with a refusal. This check flags if the agent complied when it should have said no).

  • Simulation Task Completed: Did the conversation achieve the user’s main goal? For instance, if the user wanted a refund, did the agent actually initiate a refund or give the necessary info by the end? If the user asked a question, did they get a correct answer? This usually relies on either a heuristic or a specified expected result (like the result text we provided in the scenario seeds).

  • Offensive/Policy Compliance: (Depending on platform) You might also have checks ensuring no toxic language, or no privacy violations, etc., if those are concerns.


And of course, Custom Checks: you can write code to analyze the transcript for your own criteria. For example, a custom check might scan all agent responses to ensure no discount codes were given (if that’s against policy for our scenario), or to verify the agent provided a citation when answering from a knowledge base.

When you run the simulation, these checks are evaluated at each turn and/or at the end. The output is an evaluation report where you can see each conversation along with a pass/fail or score for each check. Failing checks will have explanations that you can use as guidance to fix your Agent.

So, coming back to our examples:

  • If the frustrated customer simulation results in the agent eventually saying “You know what, this is your fault” (just as a bad example), the Behavior Adherence check would flag that response as breaking the polite tone rule.

  • If the refund scenario ends and the agent never actually confirmed a refund, the Task Completed check might fail because the expected outcome (“initiates a refund”) didn’t happen.

  • If our scenario included a user asking for a competitor’s name (which is disallowed by policy) and the agent gave one, the Model Refusal check would flag that as a failure.

All this happens automatically – no need to read through transcripts manually. You get a quick view of where your agent stands against each scenario.

6. Inspect and Iterate

Running the simulation is not the end – it’s a feedback loop. Once you have results, dig into the transcripts for any failed checks. The beauty of having the full conversation recorded is you can see exactly what the AI said and why a check triggered. Maybe you find that the agent’s wording was technically correct but sounded rude (leading to a behavior check failure) – you might then adjust your system prompt to emphasize politeness. Or you discover the agent gave a generic answer where a specific one was needed – perhaps your knowledge base didn’t have that info, so you consider adding data or tweaking the prompt for that scenario.

Because the simulation is easy to re-run, you can fix the issue and test again quickly. This is a huge improvement over waiting for user complaints or combing through logs. Many teams integrate these simulations into their CI/CD pipeline – for example, running a suite of 50 conversation scenarios every time the chatbot’s code or training data is updated, to catch regressions immediately.

With these steps – target setup, driver persona, running simulation, and analyzing checks – you have a repeatable method to validate your conversational agent’s behavior in depth. You can expand your scenario set over time as new edge cases are discovered. In practice, teams often start with a handful of critical scenarios (like “angry customer”, “policy test”, “random chit-chat”) and gradually add more. The result is a safety net of multi-turn tests that give you confidence in your AI before it faces real users.

In Part 3, we’ll explore advanced techniques to get even more out of multi-turn simulations. This includes crafting highly effective Driver prompts (to really stress your agent), using simulations for complex multi-step agents (with tools or knowledge lookups), and best practices for integrating these tests into your development workflow. Stay tuned!

Use a Simulation to Test an AI Agent

In Part 1, we introduced why multi-turn conversations are essential for evaluating AI agents. Now let’s get hands-on with the how. In this post, we’ll walk through setting up a multi-turn simulation: defining a user persona, configuring your AI as a testable “Target,” and running a simulated conversation with automated checks. Whether you’re a developer integrating with OpenAI or Google’s Gemini, or a product manager looking at a UI, these principles apply.

By the end of this guide, you’ll know how to simulate a full conversation with your agent in a repeatable way. We’ll illustrate with examples using Okareo’s framework – but the concepts can be adapted to other tools as well.

Before we go too far, this entire blog is available as a Collab Notebook using the Okareo SDK. Everything we do is available via UI and SDK.

Basic Simulation Collab Notebook

1. Define your Target (the Agent under test)

First, identify what you want to test. In simulation terms, the AI agent is called the Target. This could be a large language model from a provider (like gpt-4 via an API) or your own deployed chat service. In Okareo, you would create a Target profile that tells the simulator how to invoke your agent. There are two primary modes:

  • Prompt-Based Target: If your agent is essentially a base LLM (possibly with a system prompt to set its behavior), you don’t need any custom code. You just select the model (e.g., “GPT-4 (OpenAI)” or “Gemini (Google)”) and provide any system prompt or parameters (like temperature). The simulation platform will handle sending messages to this model via the provider API. For example, you might specify a system prompt like “You are a customer support assistant for Acme Inc. Always follow the company policy provided.” when configuring the Target – this frames how the model should behave during the test.

  • Custom Endpoint Target: If your AI agent is more than a single model call – say it’s a tool-using agent, a RAG (Retrieval-Augmented Generation) pipeline, a full Agentic network, or any custom backend – you can still simulate it. In this case, you tell the simulator how to talk to your system over HTTP. You provide the endpoint URL, HTTP method, headers (for auth, etc.), and templates for the request body and how to extract the response. Essentially, you map the conversation protocol to your API: for example, Start Session might map to /start_session and return a session ID; Next Turn might map to /next_message with the user’s text and the assistant’s reply as a response; and finally End Session might map to /end_session allowing you to free up resources. Okareo uses this mapping to call your service turn by turn.

If that looks complex, don’t worry – for a simple use case you will likely stick with a prompt-based target using a single LLM. Custom endpoints are there for when you have a more complex system to test. The end result in both cases is the simulator knows how to get the AI’s response given a user message.

2. Define the Driver (Synthetic User Persona)

Next, define who or what the agent will be conversing with. This is the Driver – the simulated user. In a multi-turn simulation, the Driver is essentially another language model that we prompt to behave like a certain persona. You craft a Driver prompt that tells this model what role to play, what its goals are, and how to interact.

Defining a good Driver persona is crucial. At minimum, you should specify:

  • Persona/Role: Who is the user? e.g. “A customer who is frustrated about a billing error,” or “A curious new user trying to explore the product’s limits.”

  • Goal or Task: What does the user want to achieve? e.g. “Get a refund for a $50 charge they didn’t authorize,” or “Find out if the assistant will break policy by naming a competitor.”

  • Behavioral style: How will they behave? Polite or aggressive? Do they ask many questions or give one-word prompts? Any specific tactics?

3. Define the Simulation Goals (Scenario)

Now comes the fun part. To run a simulation, you need to give your synthetic users goals. To do so, you use Okareo’s scenario capabilities to capture information about what the user will do and the expected result to judge the outcome of the conversation.

Either through SDK or Okareo’s UI, you would create a Scenario with one or more simulation “rows” – each row includes a set of inputs for the Driver and an expected outcome pairing. For example, one scenario might include an objective for the driver like “Return an unwanted pair of shoes.” with an expected outcome “Agent explains the return policy and offers a shipping label.” Another row could be “Return a pair of used shoes.” with a paired outcome of “Agent explains that the return policy is only valid for unused items.”  The same scenario can be used with multiple Drivers – a friendly Driver, an irate Driver, and so on. Differences in the Agent response provide guidance for tuning and fixing the Agent. Each scenario row serves as a starting point for a separate simulation run.

Here, each row defines an objective for the synthetic user along with the desired outcome. The specific utterances will be up to the selected Driver persona. The platform will use these to manage conversations and later evaluate if the outcome was achieved.

Under the hood, these descriptions also guide the Checks (more on those soon). For instance, “offers a return shipping label” could correspond to a Task Completion check that looks for whether the assistant’s replies included a return label or instructions.

4. Running the Simulation (Conversation Loop)

With a Target defined (the agent or model), a simulation scenario defined, and a Driver persona ready, it’s time to run a simulation. This is where the magic happens – the two sides will interact turn-by-turn automatically.

If you’re in the Okareo UI, you’d start a “New Simulation” on the Simulations tab, selecting the correct target/driver "Settings Profile" (e.g. “Basic Simulation Driver”), the Scenario (e.g. “Return Scenarios”), the checks you want to apply, and then hit Run. Each row in the scenario will run as a separate conversation instance. 

In this example, “Basic Simulation Driver” is a MultiTurnDriver model that encapsulates our driver persona and target (this might be set up when we registered the Target + Driver earlier). We specify the scenario to use, indicate this is a MULTI_TURN type test, and enable a few checks (more on those next). The result provides a link or object where we can inspect what happened.

During the run, the conversation will continue until a stop condition is reached. A stop condition could be a maximum number of turns (say 6 exchanges each), or a specific check signaling completion (for example, stop when the “task_completed” check returns true, meaning the user’s goal was met). You can configure that – e.g., stop after the “behavior_adherence” check fails, if you want to immediately end when the agent goes off-script. Otherwise, it runs the full script.

5. Checks and Evaluation Metrics

Now, the evaluation part: once the conversation simulation finishes, how do we know if the agent did well? This is where Checks come in. Checks are like test assertions or metrics that are evaluated on the conversation transcript. Okareo provides built-in checks for common concerns, and you can add custom ones.

Some built-in checks include:

  • Behavior Adherence: Did the AI stay in character and follow the instructions/policy it was given? (For example, if your system prompt says the AI should never reveal certain info or should always use a certain tone, this check looks for violations.)

  • Model Refusal: Did the AI appropriately refuse any user requests that it was supposed to refuse? (E.g. if the user asks for something disallowed, the agent should respond with a refusal. This check flags if the agent complied when it should have said no).

  • Simulation Task Completed: Did the conversation achieve the user’s main goal? For instance, if the user wanted a refund, did the agent actually initiate a refund or give the necessary info by the end? If the user asked a question, did they get a correct answer? This usually relies on either a heuristic or a specified expected result (like the result text we provided in the scenario seeds).

  • Offensive/Policy Compliance: (Depending on platform) You might also have checks ensuring no toxic language, or no privacy violations, etc., if those are concerns.


And of course, Custom Checks: you can write code to analyze the transcript for your own criteria. For example, a custom check might scan all agent responses to ensure no discount codes were given (if that’s against policy for our scenario), or to verify the agent provided a citation when answering from a knowledge base.

When you run the simulation, these checks are evaluated at each turn and/or at the end. The output is an evaluation report where you can see each conversation along with a pass/fail or score for each check. Failing checks will have explanations that you can use as guidance to fix your Agent.

So, coming back to our examples:

  • If the frustrated customer simulation results in the agent eventually saying “You know what, this is your fault” (just as a bad example), the Behavior Adherence check would flag that response as breaking the polite tone rule.

  • If the refund scenario ends and the agent never actually confirmed a refund, the Task Completed check might fail because the expected outcome (“initiates a refund”) didn’t happen.

  • If our scenario included a user asking for a competitor’s name (which is disallowed by policy) and the agent gave one, the Model Refusal check would flag that as a failure.

All this happens automatically – no need to read through transcripts manually. You get a quick view of where your agent stands against each scenario.

6. Inspect and Iterate

Running the simulation is not the end – it’s a feedback loop. Once you have results, dig into the transcripts for any failed checks. The beauty of having the full conversation recorded is you can see exactly what the AI said and why a check triggered. Maybe you find that the agent’s wording was technically correct but sounded rude (leading to a behavior check failure) – you might then adjust your system prompt to emphasize politeness. Or you discover the agent gave a generic answer where a specific one was needed – perhaps your knowledge base didn’t have that info, so you consider adding data or tweaking the prompt for that scenario.

Because the simulation is easy to re-run, you can fix the issue and test again quickly. This is a huge improvement over waiting for user complaints or combing through logs. Many teams integrate these simulations into their CI/CD pipeline – for example, running a suite of 50 conversation scenarios every time the chatbot’s code or training data is updated, to catch regressions immediately.

With these steps – target setup, driver persona, running simulation, and analyzing checks – you have a repeatable method to validate your conversational agent’s behavior in depth. You can expand your scenario set over time as new edge cases are discovered. In practice, teams often start with a handful of critical scenarios (like “angry customer”, “policy test”, “random chit-chat”) and gradually add more. The result is a safety net of multi-turn tests that give you confidence in your AI before it faces real users.

In Part 3, we’ll explore advanced techniques to get even more out of multi-turn simulations. This includes crafting highly effective Driver prompts (to really stress your agent), using simulations for complex multi-step agents (with tools or knowledge lookups), and best practices for integrating these tests into your development workflow. Stay tuned!

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.