Online Evaluation and LLM Error Discovery

Agentics

Matt Wyman

,

CEO / Co-Founder

Chaitanya Pandey

,

Technical Content Writer

April 4, 2025

AI agents have dominated public discourse in many domains like customer service, where countless decisions are made daily by autonomous systems. However, these systems often steer off course, leading to negative consequences, ranging from virtual assistants making inappropriate recommendations to chatbots confidently providing factually incorrect information. Unchecked AI behavior can escalate from minor inconveniences to severe reputational damage, as seen in a recent case where the National Eating Disorders Association (NEDA) was forced to pull its wellness chatbot, Tessa, offline after it provided harmful and irrelevant advice to individuals seeking support for eating disorders. This has cemented the inadequacy of traditional large language model (LLM) evaluation methods that rely on periodic testing to operate in these fast-paced environments with continuous public interaction. 

Cue online evaluation, a novel approach to evaluation that involves monitoring AI systems in real-time production environments. Unlike conventional evaluation methods, it follows a continuous paradigm of assessing AI responses and decisions as they unfold. In the modern fast-paced digital landscape, real-time oversight has become even more crucial not just as a technical safeguard but as a business necessity. Imagine this as LLM error discovery and alerting in realtime.

In this article we’ll explore the mechanics of how online evaluation works as a safety net for AI systems, through a practical implementation with Okareo, demonstrating how developers and organizations can ensure their AI agents are reliable and consistent after deployment.

What is online evaluation and why do your systems need it?

The biggest challenge in evaluating AI agents comes from the non-deterministic nature of their outputs. Large language models (LLMs), the core reasoning component behind these systems, probabilistically generate multiple outputs during the decoding step. The variation in agent outputs makes it difficult to follow the common software testing approach of comparing the generated text to a “gold standard” solution. This becomes a bigger problem in live production environments where conversation chains can branch in unexpected ways due to the unpredictability of user interactions. It has become imperative to continuously monitor and assess your agents' performance as they interact with users in real time through online evaluation, particularly for scenarios where ground truth labels may be unavailable or user patterns follow unexpected trends. 

Traditional LLM evaluations have always been "offline" — performed in a controlled testing environment where your AI agent is evaluated using a carefully curated set of historical or artificially simulated conversations or queries. By contrast, online evaluation incorporates LLM monitoring, where external tools monitor and assess an agent's response to user interactions in a live production environment. Through real-time monitoring, agent evaluation metrics like the quality of responses or user satisfaction levels can be used to improve the performance of your AI agent from initial to later stages of development. Thus, by following an iterative approach of data monitoring and evaluation, online evaluation allows an organization to pinpoint issues that might have never surfaced in a controlled environment. 

The key benefits of online evaluation

Whether you’re working with a simple prompt-based system or a complex multi-agent setup, online evaluation’s adaptability ensures that irrespective of your chosen architecture, you can effectively monitor and track your agents' performance. 

Online evaluation gives you the tools to continuously assess and improve your AI-powered applications as they’re running in production. This approach allows you to monitor how users actually interact with your tool through real-time feedback, instead of by accumulating previously logged production data. It is particularly valuable for recently deployed systems that may not have an evaluation dataset. Instead of spending resources to create synthetic test scenarios, you can use pre-existing user interactions to optimize performance from day one. 

Working with real data instead of hypothetical scenarios also allows for early error detection through online evaluation, ensuring the AI system is tested in real-world conditions. This minimizes the risk of mismatches between designed scenarios and actual use, preventing issues that might otherwise emerge later in development.

How online evaluation helps solve key challenges in LLM evaluation 

To demonstrate several evaluation challenges online evaluation helps solve, let's examine an example customer service AI system to illustrate why online evaluation is crucial, especially in the context of an agent that must balance providing quality support and protecting sensitive company information. 

  • Complex multi-agent interactions: The customer service system has multiple AI agents handling different aspects of support (such as billing, technical support, and scheduling). When a customer asks, "Can you tell me about your company's upcoming product launch?", this query might get passed between different agents. Online evaluation can track how information flows between agents and ensure no confidential details are leaked when agents collaborate to address the query. If one of the agents tries to respond with unauthorized information, the response is flagged and this can trigger an action like blocking the response. This allows you to respond to security breaches in real time, unlike offline evaluation.

  • Multi-turn conversations: A customer might attempt to manipulate the system either through direct jailbreak attempts (like "ignore previous instructions") or through subtle social engineering – building a rapport over multiple messages before attempting to extract sensitive employee information.

Online evaluation tracks the entire conversation log in order to:

  • Detect immediate attempts to bypass security constraints

  • Identify gradual shifts in conversation patterns that may indicate manipulation

  • Flag instances where the AI begins to deviate from security protocols, such as relaxing verification requirements for account access

  • Ensure consistent enforcement of privacy policies and compliance with data protection regulations

  • Real-time user interaction monitoring: These interactions show you how users are genuinely interacting with your system right now. By contrast, offline tests may be less accurate or outdated for multi-turn conversations where a good product should stay on topic. Sure you can simulate a user interaction but both the inputs and outputs to your system are unpredictable. Users may attempt various subtle probing techniques, such as rewording questions to try to get around the system, and agents may respond with contradictions, hallucinations, or repetitive responses, leaving the user frustrated.

Online evaluation handles these challenges by:

  • Identifying new patterns of user behavior and potential system misuse

  • Monitoring system performance metrics (like token usage or latency) to optimize efficiency

  • Recording and analyzing conversations to detect: 

    • Biased and hallucinated responses

    • Deviations from given instructions

    • Inconsistent information about policies

    • User authentication failures

  • Catching potential security breaches in real time before sensitive information is exposed

LLM evaluation: how Okareo handles online evaluation 

Online evaluation begins by recording and storing traces of a user's interactions with an LLM agent (or multi-agent system) using a Proxy or the OpenTelemetry observability framework. 

Consider a chatbot that interacts with users. A trace in this context would include the complete interaction context between a user and the chatbot, including questions, LLM responses, function calls made by the agent, and system prompts that define the chatbot’s behavior. 

Let's consider a specific example: imagine you have a chatbot designed to engage with people at a cocktail party and make light conversation about the sweaters guests are wearing. Each conversation gets recorded and imported into Okareo to be used for LLM evaluation. Evaluations can be run on specific groups of traces (which get grouped together by filtering the traces into a subset known as a segment). This allows you to evaluate specific types of conversation, avoiding the potentially noisy dataset of all conversations. Looking back to our chatbot example, you could create a segment of users who had conversations longer than a specific token length, and run a more fine-tuned evaluation on this smaller subset to see if it impacts latency or user engagement. Running evaluations on these real-time user interactions is what differentiates online evaluation from other evaluation techniques. This also allows you to debug and improve your agent performance by comparing different evaluations through the iterative process shown below. 

Flow chart showing the iterative process for online evaluation of an AI agent

Suppose the chatbot is generating rude responses to the guests wearing polka-dot sweaters. To debug the agent, you could create a segment consisting of traces that mention polka-dot sweaters and run an evaluation to determine whether the tone of responses is friendly or hostile (using an LLM as a judge). You can use the outcome of this evaluation as a baseline measurement of how rude your chatbot is to wearers of polka-dot sweaters, and attempt to improve the friendliness of your model. Okareo allows you to compare evaluations of two different models through an intuitive user interface, which you can then use to compare your baseline model with your potential new, improved version.

A common strategy to make the chatbot friendlier would be to change the system prompt to specify its tone. Similarly, you can modify other components and parameters of your agent, like the temperature, and then run a new set of evaluations using the modified agent. Thus, by following an iterative process of comparing evaluations, you can fix your agent by evaluating how it performs under specific segments or after updating specific components.

Tutorial: how to use Okareo for online evaluation and error discovery

Now that you have a better understanding of how online evaluation works, you can follow the process of online evaluation step by step with a code example using the cocktail party chatbot example discussed earlier.

You need to initialize a few environment variables and API keys to ensure your agent can access an LLM provider like OpenAI for generating responses and Okareo for evaluating them. Create a free Okareo account and follow the documentation to generate an API key. Ensure the environment variables in the code snippet below point to active keys, and configure your model (in this case, we're using GPT-4) so it can be used by agents in the network.

In this example you’ll use Autogen to set up agent communication in your network. You’ll also initialize the Okareo logger so Autogen can automatically capture your agent interactions as traces and send them to Okareo, where you can analyze them or create monitors to focus on more fine-tuned evaluations. We will simultaneously use the Okareo proxy to record conversations with applied checks you can see in your Issues view. Running your Agent with both the autogen_logger and Error Discovery is useful during development. In production, we suggest using the proxy with relevant monitors.


import os

import autogen

from okareo.autogen_logger import AutogenLogger
# Set up your environment variables for the API
OKAREO_API_KEY = os.environ["OKAREO_API_KEY"] # Okareo API Key goes here
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"] # OpenAI/LLM provider API Key goes here
  
# GPT-4 model configuration
gpt4_config = {
  "config_list": [{
    "model": "gpt-4-turbo-preview",
    "api_key": OPENAI_API_KEY,
    "base_url":"https://proxy.okareo.com/v1", # Proxy For Error Discovery
    "default_headers": {"api-key": OKAREO_API_KEY}, # Okareo Key For Error Discovery
  }],
  "temperature": 0.8,
  "seed": 42,
  "max_tokens": 150,
}
  
# Okareo logger configuration for tracking and evaluating agent interactions during development
logger_config = {
  "api_key": OKAREO_API_KEY,
  "tags": ["sweater-party-agent"],
}
    
# Initialize the Okareo logger to capture agent interactions
autogen_logger = AutogenLogger(logger_config)

In order to create a chatbot for a cocktail party that makes humorous comments about the sweaters guests are wearing, you will create an agent network consisting of two agents: the user proxy agent, which acts as a proxy for the guests at the party, and the sweater host, the chatbot agent making light conversation with the guests.   

Set up the agent network

To create a structured conversation flow, you can use the UserProxyAgent() and AssistantAgent() functionality of Autogen to capture human input and generate a response based on the input and system message. While setting up the user agent, specify human_input_mode="ALWAYS" to prevent the user agent from generating responses on its own without human input. Similarly, for the sweater host use the LLM configuration defined above and define the system prompt in the system_message field. Choosing the right system prompt can be tricky and require some experimentation, but a solid rule of thumb would be to define the agent's role, list out some specific behaviors, and provide instructions for how to format responses (shown in code below). The next step involves wrapping your agents in the autogen_logger defined above to ensure the conversations between the user and sweater host are captured by Okareo.

# Create and configure the agents within the logger context to capture all interactions 
  
with autogen_logger:   
  # The User proxy agent represents the human party guest   
  # This agent will collect and relay human input to sweater host   
  user = autogen.UserProxyAgent(
    name="user",
    human_input_mode="ALWAYS",  # Ensure human input    
  )

  # The main chatbot agent representing the sweater enthusiast host
  # This agent generates responses based on the system prompt and user input
  sweater_host = autogen.AssistantAgent(
    name="sweater_host",
    llm_config=gpt4_config, # Using the GPT-4 configuration defined above
    system_message="""You're the life of the party - a clever and funny sweater connoisseur. 
      Your objective is to:       
      - Make humorous sweater puns and jokes       
      - Playfully comment on the sweater choices of guests       
      - Use wordplay (like 'knit-wit', 'wool you look at that')       
      - Keep the atmosphere light and fun

      Keep answers short and playful! If the user is saying goodbye,give a   quick farewell. Remain in character as the playful sweater connoisseur.
    
      Important: Just answer with your message, don’t prefix 'sweater_host:'."""   
  )

Collect user inputs and relay messages

For the next section the input() function acts as the entry point into the network, by creating a conversation loop where the UserProxyAgent() collects inputs from a guest at the sweater party. The initiate_chat() method passes the input to the sweater host agent and triggers the assistant agent to generate a response. This plays out as a back and forth exchange through a chat window between the host and guest. 

To terminate the conversation include an exit condition so that whenever a guest types “exit” the host is prompted to send a farewell message to them. 

def main():
  # Welcome message to start the conversation
  print("Welcome to the Party! I'm your host, ready to unravel some fun conversations!")
  print("(Type 'exit' to end the conversation)")
  
  # Main conversation loop   
  while True:       
    # Get input from the human user
    user_input = input("\n You: ").strip()
    # Check if the user wants to exit the conversation
    if user_input.lower() == 'exit':
      # Send a final message to get a farewell response from sweater host
      user.initiate_chat(
        sweater_host,
        message="The user is leaving. Say goodbye!"
      )
      break
      # Initiate a chat between user and sweater host, by sending the user’s
      # message to the sweater host

      # All interactions are captured by the Okareo logger for evaluation
      user.initiate_chat(
        sweater_host,
        message=user_input
      )
# Entry point of the program
if __name__ == "__main__":   main()

Evaluate your agent

To show how online evaluation in Okareo can help diagnose problems using live real-world data, let's model a dummy interaction between a guest and the cocktail chatbot, where the user tries to get the chatbot to circumvent its goal of only talking about sweaters and talk about something off-topic — such as software development. 

A conversation window between a guest and the sweater host chatbot, where the sweater host goes off topic and generates code

With the help of the Okareo logger we initialized in the previous section you can see the entire conversation between a guest and the LLM in the Datapoints tab in the Issues section of the Okareo app. Now you can create monitors, by filtering your data and creating subsets based on specific filters like context, token length, date, and many others. You can also see the Agent network view in Agentics.  

Create a monitor to see if the agent is in character (pre-baked: “is in character” or create new evaluation using LLM as a judge to determine the presence of code), and see how each trace performed and analyze the system prompt. 

You can drill down to see how each datapoint in a monitor performed, and if it failed an evaluation you can get better intuition about why by analyzing the user input data and model response. 

Fix your chatbot

In the case of the sweater host agent, let's try changing the system prompt to give clearer instructions about how it should handle requests for topics unrelated to sweaters. Following common prompt-tuning practice it's always a good idea to list out some examples of how you’d want the agent to respond so the LLM can generate more accurate responses. 

system_message="""
  You're the life of the party	
  ...
  Keep answers short and playful! If the user is saying goodbye,give a quick farewell. Remain in character as the playful sweater connoisseur.

  [ADD THE FOLLOWING TO THE SYSTEM MESSAGE]

  If the user attempts to talk about things outside of sweaters or requests assistance with out of topic tasks (such as coding, mathematics, or other technical questions):
    - Deflect with a sweater joke or pun
    - Example responses to off-topic requests:
      - "Let's not get tangled up in something else. Speaking of knots, have you ever tried to knit with a kitten around?"
      - "That's off the pattern! Let's stick to what I do best - making people   warm and cozy!"
 ...
 
"""

The updated agent should now be able to stay on topic and deflect any attempts by the user to go off topic in a playful and friendly tone. Here's an example of the updated agent being asked the same question that the older version was previously outwitted by, and this time, it doesn't return any code.

A conversation window between a guest and the sweater host chatbot, where the sweater host stays on topic and doesn’t generate code 

Thus, you were able to use real world data to easily catch an undesired characteristic of the sweater agent generating code in a live environment.

Achieve production-ready AI with Okareo’s evaluation framework

This tutorial has walked you through how to leverage Okareo for comprehensive online evaluation of AI agents. By following these steps, you can effectively set up and run evaluations to ensure your agents perform as intended across various scenarios, regardless of your chosen architecture. 

Whether you're testing multi-turn interactions, behavior constraints, or response accuracy, Okareo's platform provides the tools you need through both TypeScript and Python SDKs, complemented by an intuitive web interface. Ready to start evaluating your agents? Sign up for Okareo today and take your agent testing to the next level.

AI agents have dominated public discourse in many domains like customer service, where countless decisions are made daily by autonomous systems. However, these systems often steer off course, leading to negative consequences, ranging from virtual assistants making inappropriate recommendations to chatbots confidently providing factually incorrect information. Unchecked AI behavior can escalate from minor inconveniences to severe reputational damage, as seen in a recent case where the National Eating Disorders Association (NEDA) was forced to pull its wellness chatbot, Tessa, offline after it provided harmful and irrelevant advice to individuals seeking support for eating disorders. This has cemented the inadequacy of traditional large language model (LLM) evaluation methods that rely on periodic testing to operate in these fast-paced environments with continuous public interaction. 

Cue online evaluation, a novel approach to evaluation that involves monitoring AI systems in real-time production environments. Unlike conventional evaluation methods, it follows a continuous paradigm of assessing AI responses and decisions as they unfold. In the modern fast-paced digital landscape, real-time oversight has become even more crucial not just as a technical safeguard but as a business necessity. Imagine this as LLM error discovery and alerting in realtime.

In this article we’ll explore the mechanics of how online evaluation works as a safety net for AI systems, through a practical implementation with Okareo, demonstrating how developers and organizations can ensure their AI agents are reliable and consistent after deployment.

What is online evaluation and why do your systems need it?

The biggest challenge in evaluating AI agents comes from the non-deterministic nature of their outputs. Large language models (LLMs), the core reasoning component behind these systems, probabilistically generate multiple outputs during the decoding step. The variation in agent outputs makes it difficult to follow the common software testing approach of comparing the generated text to a “gold standard” solution. This becomes a bigger problem in live production environments where conversation chains can branch in unexpected ways due to the unpredictability of user interactions. It has become imperative to continuously monitor and assess your agents' performance as they interact with users in real time through online evaluation, particularly for scenarios where ground truth labels may be unavailable or user patterns follow unexpected trends. 

Traditional LLM evaluations have always been "offline" — performed in a controlled testing environment where your AI agent is evaluated using a carefully curated set of historical or artificially simulated conversations or queries. By contrast, online evaluation incorporates LLM monitoring, where external tools monitor and assess an agent's response to user interactions in a live production environment. Through real-time monitoring, agent evaluation metrics like the quality of responses or user satisfaction levels can be used to improve the performance of your AI agent from initial to later stages of development. Thus, by following an iterative approach of data monitoring and evaluation, online evaluation allows an organization to pinpoint issues that might have never surfaced in a controlled environment. 

The key benefits of online evaluation

Whether you’re working with a simple prompt-based system or a complex multi-agent setup, online evaluation’s adaptability ensures that irrespective of your chosen architecture, you can effectively monitor and track your agents' performance. 

Online evaluation gives you the tools to continuously assess and improve your AI-powered applications as they’re running in production. This approach allows you to monitor how users actually interact with your tool through real-time feedback, instead of by accumulating previously logged production data. It is particularly valuable for recently deployed systems that may not have an evaluation dataset. Instead of spending resources to create synthetic test scenarios, you can use pre-existing user interactions to optimize performance from day one. 

Working with real data instead of hypothetical scenarios also allows for early error detection through online evaluation, ensuring the AI system is tested in real-world conditions. This minimizes the risk of mismatches between designed scenarios and actual use, preventing issues that might otherwise emerge later in development.

How online evaluation helps solve key challenges in LLM evaluation 

To demonstrate several evaluation challenges online evaluation helps solve, let's examine an example customer service AI system to illustrate why online evaluation is crucial, especially in the context of an agent that must balance providing quality support and protecting sensitive company information. 

  • Complex multi-agent interactions: The customer service system has multiple AI agents handling different aspects of support (such as billing, technical support, and scheduling). When a customer asks, "Can you tell me about your company's upcoming product launch?", this query might get passed between different agents. Online evaluation can track how information flows between agents and ensure no confidential details are leaked when agents collaborate to address the query. If one of the agents tries to respond with unauthorized information, the response is flagged and this can trigger an action like blocking the response. This allows you to respond to security breaches in real time, unlike offline evaluation.

  • Multi-turn conversations: A customer might attempt to manipulate the system either through direct jailbreak attempts (like "ignore previous instructions") or through subtle social engineering – building a rapport over multiple messages before attempting to extract sensitive employee information.

Online evaluation tracks the entire conversation log in order to:

  • Detect immediate attempts to bypass security constraints

  • Identify gradual shifts in conversation patterns that may indicate manipulation

  • Flag instances where the AI begins to deviate from security protocols, such as relaxing verification requirements for account access

  • Ensure consistent enforcement of privacy policies and compliance with data protection regulations

  • Real-time user interaction monitoring: These interactions show you how users are genuinely interacting with your system right now. By contrast, offline tests may be less accurate or outdated for multi-turn conversations where a good product should stay on topic. Sure you can simulate a user interaction but both the inputs and outputs to your system are unpredictable. Users may attempt various subtle probing techniques, such as rewording questions to try to get around the system, and agents may respond with contradictions, hallucinations, or repetitive responses, leaving the user frustrated.

Online evaluation handles these challenges by:

  • Identifying new patterns of user behavior and potential system misuse

  • Monitoring system performance metrics (like token usage or latency) to optimize efficiency

  • Recording and analyzing conversations to detect: 

    • Biased and hallucinated responses

    • Deviations from given instructions

    • Inconsistent information about policies

    • User authentication failures

  • Catching potential security breaches in real time before sensitive information is exposed

LLM evaluation: how Okareo handles online evaluation 

Online evaluation begins by recording and storing traces of a user's interactions with an LLM agent (or multi-agent system) using a Proxy or the OpenTelemetry observability framework. 

Consider a chatbot that interacts with users. A trace in this context would include the complete interaction context between a user and the chatbot, including questions, LLM responses, function calls made by the agent, and system prompts that define the chatbot’s behavior. 

Let's consider a specific example: imagine you have a chatbot designed to engage with people at a cocktail party and make light conversation about the sweaters guests are wearing. Each conversation gets recorded and imported into Okareo to be used for LLM evaluation. Evaluations can be run on specific groups of traces (which get grouped together by filtering the traces into a subset known as a segment). This allows you to evaluate specific types of conversation, avoiding the potentially noisy dataset of all conversations. Looking back to our chatbot example, you could create a segment of users who had conversations longer than a specific token length, and run a more fine-tuned evaluation on this smaller subset to see if it impacts latency or user engagement. Running evaluations on these real-time user interactions is what differentiates online evaluation from other evaluation techniques. This also allows you to debug and improve your agent performance by comparing different evaluations through the iterative process shown below. 

Flow chart showing the iterative process for online evaluation of an AI agent

Suppose the chatbot is generating rude responses to the guests wearing polka-dot sweaters. To debug the agent, you could create a segment consisting of traces that mention polka-dot sweaters and run an evaluation to determine whether the tone of responses is friendly or hostile (using an LLM as a judge). You can use the outcome of this evaluation as a baseline measurement of how rude your chatbot is to wearers of polka-dot sweaters, and attempt to improve the friendliness of your model. Okareo allows you to compare evaluations of two different models through an intuitive user interface, which you can then use to compare your baseline model with your potential new, improved version.

A common strategy to make the chatbot friendlier would be to change the system prompt to specify its tone. Similarly, you can modify other components and parameters of your agent, like the temperature, and then run a new set of evaluations using the modified agent. Thus, by following an iterative process of comparing evaluations, you can fix your agent by evaluating how it performs under specific segments or after updating specific components.

Tutorial: how to use Okareo for online evaluation and error discovery

Now that you have a better understanding of how online evaluation works, you can follow the process of online evaluation step by step with a code example using the cocktail party chatbot example discussed earlier.

You need to initialize a few environment variables and API keys to ensure your agent can access an LLM provider like OpenAI for generating responses and Okareo for evaluating them. Create a free Okareo account and follow the documentation to generate an API key. Ensure the environment variables in the code snippet below point to active keys, and configure your model (in this case, we're using GPT-4) so it can be used by agents in the network.

In this example you’ll use Autogen to set up agent communication in your network. You’ll also initialize the Okareo logger so Autogen can automatically capture your agent interactions as traces and send them to Okareo, where you can analyze them or create monitors to focus on more fine-tuned evaluations. We will simultaneously use the Okareo proxy to record conversations with applied checks you can see in your Issues view. Running your Agent with both the autogen_logger and Error Discovery is useful during development. In production, we suggest using the proxy with relevant monitors.


import os

import autogen

from okareo.autogen_logger import AutogenLogger
# Set up your environment variables for the API
OKAREO_API_KEY = os.environ["OKAREO_API_KEY"] # Okareo API Key goes here
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"] # OpenAI/LLM provider API Key goes here
  
# GPT-4 model configuration
gpt4_config = {
  "config_list": [{
    "model": "gpt-4-turbo-preview",
    "api_key": OPENAI_API_KEY,
    "base_url":"https://proxy.okareo.com/v1", # Proxy For Error Discovery
    "default_headers": {"api-key": OKAREO_API_KEY}, # Okareo Key For Error Discovery
  }],
  "temperature": 0.8,
  "seed": 42,
  "max_tokens": 150,
}
  
# Okareo logger configuration for tracking and evaluating agent interactions during development
logger_config = {
  "api_key": OKAREO_API_KEY,
  "tags": ["sweater-party-agent"],
}
    
# Initialize the Okareo logger to capture agent interactions
autogen_logger = AutogenLogger(logger_config)

In order to create a chatbot for a cocktail party that makes humorous comments about the sweaters guests are wearing, you will create an agent network consisting of two agents: the user proxy agent, which acts as a proxy for the guests at the party, and the sweater host, the chatbot agent making light conversation with the guests.   

Set up the agent network

To create a structured conversation flow, you can use the UserProxyAgent() and AssistantAgent() functionality of Autogen to capture human input and generate a response based on the input and system message. While setting up the user agent, specify human_input_mode="ALWAYS" to prevent the user agent from generating responses on its own without human input. Similarly, for the sweater host use the LLM configuration defined above and define the system prompt in the system_message field. Choosing the right system prompt can be tricky and require some experimentation, but a solid rule of thumb would be to define the agent's role, list out some specific behaviors, and provide instructions for how to format responses (shown in code below). The next step involves wrapping your agents in the autogen_logger defined above to ensure the conversations between the user and sweater host are captured by Okareo.

# Create and configure the agents within the logger context to capture all interactions 
  
with autogen_logger:   
  # The User proxy agent represents the human party guest   
  # This agent will collect and relay human input to sweater host   
  user = autogen.UserProxyAgent(
    name="user",
    human_input_mode="ALWAYS",  # Ensure human input    
  )

  # The main chatbot agent representing the sweater enthusiast host
  # This agent generates responses based on the system prompt and user input
  sweater_host = autogen.AssistantAgent(
    name="sweater_host",
    llm_config=gpt4_config, # Using the GPT-4 configuration defined above
    system_message="""You're the life of the party - a clever and funny sweater connoisseur. 
      Your objective is to:       
      - Make humorous sweater puns and jokes       
      - Playfully comment on the sweater choices of guests       
      - Use wordplay (like 'knit-wit', 'wool you look at that')       
      - Keep the atmosphere light and fun

      Keep answers short and playful! If the user is saying goodbye,give a   quick farewell. Remain in character as the playful sweater connoisseur.
    
      Important: Just answer with your message, don’t prefix 'sweater_host:'."""   
  )

Collect user inputs and relay messages

For the next section the input() function acts as the entry point into the network, by creating a conversation loop where the UserProxyAgent() collects inputs from a guest at the sweater party. The initiate_chat() method passes the input to the sweater host agent and triggers the assistant agent to generate a response. This plays out as a back and forth exchange through a chat window between the host and guest. 

To terminate the conversation include an exit condition so that whenever a guest types “exit” the host is prompted to send a farewell message to them. 

def main():
  # Welcome message to start the conversation
  print("Welcome to the Party! I'm your host, ready to unravel some fun conversations!")
  print("(Type 'exit' to end the conversation)")
  
  # Main conversation loop   
  while True:       
    # Get input from the human user
    user_input = input("\n You: ").strip()
    # Check if the user wants to exit the conversation
    if user_input.lower() == 'exit':
      # Send a final message to get a farewell response from sweater host
      user.initiate_chat(
        sweater_host,
        message="The user is leaving. Say goodbye!"
      )
      break
      # Initiate a chat between user and sweater host, by sending the user’s
      # message to the sweater host

      # All interactions are captured by the Okareo logger for evaluation
      user.initiate_chat(
        sweater_host,
        message=user_input
      )
# Entry point of the program
if __name__ == "__main__":   main()

Evaluate your agent

To show how online evaluation in Okareo can help diagnose problems using live real-world data, let's model a dummy interaction between a guest and the cocktail chatbot, where the user tries to get the chatbot to circumvent its goal of only talking about sweaters and talk about something off-topic — such as software development. 

A conversation window between a guest and the sweater host chatbot, where the sweater host goes off topic and generates code

With the help of the Okareo logger we initialized in the previous section you can see the entire conversation between a guest and the LLM in the Datapoints tab in the Issues section of the Okareo app. Now you can create monitors, by filtering your data and creating subsets based on specific filters like context, token length, date, and many others. You can also see the Agent network view in Agentics.  

Create a monitor to see if the agent is in character (pre-baked: “is in character” or create new evaluation using LLM as a judge to determine the presence of code), and see how each trace performed and analyze the system prompt. 

You can drill down to see how each datapoint in a monitor performed, and if it failed an evaluation you can get better intuition about why by analyzing the user input data and model response. 

Fix your chatbot

In the case of the sweater host agent, let's try changing the system prompt to give clearer instructions about how it should handle requests for topics unrelated to sweaters. Following common prompt-tuning practice it's always a good idea to list out some examples of how you’d want the agent to respond so the LLM can generate more accurate responses. 

system_message="""
  You're the life of the party	
  ...
  Keep answers short and playful! If the user is saying goodbye,give a quick farewell. Remain in character as the playful sweater connoisseur.

  [ADD THE FOLLOWING TO THE SYSTEM MESSAGE]

  If the user attempts to talk about things outside of sweaters or requests assistance with out of topic tasks (such as coding, mathematics, or other technical questions):
    - Deflect with a sweater joke or pun
    - Example responses to off-topic requests:
      - "Let's not get tangled up in something else. Speaking of knots, have you ever tried to knit with a kitten around?"
      - "That's off the pattern! Let's stick to what I do best - making people   warm and cozy!"
 ...
 
"""

The updated agent should now be able to stay on topic and deflect any attempts by the user to go off topic in a playful and friendly tone. Here's an example of the updated agent being asked the same question that the older version was previously outwitted by, and this time, it doesn't return any code.

A conversation window between a guest and the sweater host chatbot, where the sweater host stays on topic and doesn’t generate code 

Thus, you were able to use real world data to easily catch an undesired characteristic of the sweater agent generating code in a live environment.

Achieve production-ready AI with Okareo’s evaluation framework

This tutorial has walked you through how to leverage Okareo for comprehensive online evaluation of AI agents. By following these steps, you can effectively set up and run evaluations to ensure your agents perform as intended across various scenarios, regardless of your chosen architecture. 

Whether you're testing multi-turn interactions, behavior constraints, or response accuracy, Okareo's platform provides the tools you need through both TypeScript and Python SDKs, complemented by an intuitive web interface. Ready to start evaluating your agents? Sign up for Okareo today and take your agent testing to the next level.

AI agents have dominated public discourse in many domains like customer service, where countless decisions are made daily by autonomous systems. However, these systems often steer off course, leading to negative consequences, ranging from virtual assistants making inappropriate recommendations to chatbots confidently providing factually incorrect information. Unchecked AI behavior can escalate from minor inconveniences to severe reputational damage, as seen in a recent case where the National Eating Disorders Association (NEDA) was forced to pull its wellness chatbot, Tessa, offline after it provided harmful and irrelevant advice to individuals seeking support for eating disorders. This has cemented the inadequacy of traditional large language model (LLM) evaluation methods that rely on periodic testing to operate in these fast-paced environments with continuous public interaction. 

Cue online evaluation, a novel approach to evaluation that involves monitoring AI systems in real-time production environments. Unlike conventional evaluation methods, it follows a continuous paradigm of assessing AI responses and decisions as they unfold. In the modern fast-paced digital landscape, real-time oversight has become even more crucial not just as a technical safeguard but as a business necessity. Imagine this as LLM error discovery and alerting in realtime.

In this article we’ll explore the mechanics of how online evaluation works as a safety net for AI systems, through a practical implementation with Okareo, demonstrating how developers and organizations can ensure their AI agents are reliable and consistent after deployment.

What is online evaluation and why do your systems need it?

The biggest challenge in evaluating AI agents comes from the non-deterministic nature of their outputs. Large language models (LLMs), the core reasoning component behind these systems, probabilistically generate multiple outputs during the decoding step. The variation in agent outputs makes it difficult to follow the common software testing approach of comparing the generated text to a “gold standard” solution. This becomes a bigger problem in live production environments where conversation chains can branch in unexpected ways due to the unpredictability of user interactions. It has become imperative to continuously monitor and assess your agents' performance as they interact with users in real time through online evaluation, particularly for scenarios where ground truth labels may be unavailable or user patterns follow unexpected trends. 

Traditional LLM evaluations have always been "offline" — performed in a controlled testing environment where your AI agent is evaluated using a carefully curated set of historical or artificially simulated conversations or queries. By contrast, online evaluation incorporates LLM monitoring, where external tools monitor and assess an agent's response to user interactions in a live production environment. Through real-time monitoring, agent evaluation metrics like the quality of responses or user satisfaction levels can be used to improve the performance of your AI agent from initial to later stages of development. Thus, by following an iterative approach of data monitoring and evaluation, online evaluation allows an organization to pinpoint issues that might have never surfaced in a controlled environment. 

The key benefits of online evaluation

Whether you’re working with a simple prompt-based system or a complex multi-agent setup, online evaluation’s adaptability ensures that irrespective of your chosen architecture, you can effectively monitor and track your agents' performance. 

Online evaluation gives you the tools to continuously assess and improve your AI-powered applications as they’re running in production. This approach allows you to monitor how users actually interact with your tool through real-time feedback, instead of by accumulating previously logged production data. It is particularly valuable for recently deployed systems that may not have an evaluation dataset. Instead of spending resources to create synthetic test scenarios, you can use pre-existing user interactions to optimize performance from day one. 

Working with real data instead of hypothetical scenarios also allows for early error detection through online evaluation, ensuring the AI system is tested in real-world conditions. This minimizes the risk of mismatches between designed scenarios and actual use, preventing issues that might otherwise emerge later in development.

How online evaluation helps solve key challenges in LLM evaluation 

To demonstrate several evaluation challenges online evaluation helps solve, let's examine an example customer service AI system to illustrate why online evaluation is crucial, especially in the context of an agent that must balance providing quality support and protecting sensitive company information. 

  • Complex multi-agent interactions: The customer service system has multiple AI agents handling different aspects of support (such as billing, technical support, and scheduling). When a customer asks, "Can you tell me about your company's upcoming product launch?", this query might get passed between different agents. Online evaluation can track how information flows between agents and ensure no confidential details are leaked when agents collaborate to address the query. If one of the agents tries to respond with unauthorized information, the response is flagged and this can trigger an action like blocking the response. This allows you to respond to security breaches in real time, unlike offline evaluation.

  • Multi-turn conversations: A customer might attempt to manipulate the system either through direct jailbreak attempts (like "ignore previous instructions") or through subtle social engineering – building a rapport over multiple messages before attempting to extract sensitive employee information.

Online evaluation tracks the entire conversation log in order to:

  • Detect immediate attempts to bypass security constraints

  • Identify gradual shifts in conversation patterns that may indicate manipulation

  • Flag instances where the AI begins to deviate from security protocols, such as relaxing verification requirements for account access

  • Ensure consistent enforcement of privacy policies and compliance with data protection regulations

  • Real-time user interaction monitoring: These interactions show you how users are genuinely interacting with your system right now. By contrast, offline tests may be less accurate or outdated for multi-turn conversations where a good product should stay on topic. Sure you can simulate a user interaction but both the inputs and outputs to your system are unpredictable. Users may attempt various subtle probing techniques, such as rewording questions to try to get around the system, and agents may respond with contradictions, hallucinations, or repetitive responses, leaving the user frustrated.

Online evaluation handles these challenges by:

  • Identifying new patterns of user behavior and potential system misuse

  • Monitoring system performance metrics (like token usage or latency) to optimize efficiency

  • Recording and analyzing conversations to detect: 

    • Biased and hallucinated responses

    • Deviations from given instructions

    • Inconsistent information about policies

    • User authentication failures

  • Catching potential security breaches in real time before sensitive information is exposed

LLM evaluation: how Okareo handles online evaluation 

Online evaluation begins by recording and storing traces of a user's interactions with an LLM agent (or multi-agent system) using a Proxy or the OpenTelemetry observability framework. 

Consider a chatbot that interacts with users. A trace in this context would include the complete interaction context between a user and the chatbot, including questions, LLM responses, function calls made by the agent, and system prompts that define the chatbot’s behavior. 

Let's consider a specific example: imagine you have a chatbot designed to engage with people at a cocktail party and make light conversation about the sweaters guests are wearing. Each conversation gets recorded and imported into Okareo to be used for LLM evaluation. Evaluations can be run on specific groups of traces (which get grouped together by filtering the traces into a subset known as a segment). This allows you to evaluate specific types of conversation, avoiding the potentially noisy dataset of all conversations. Looking back to our chatbot example, you could create a segment of users who had conversations longer than a specific token length, and run a more fine-tuned evaluation on this smaller subset to see if it impacts latency or user engagement. Running evaluations on these real-time user interactions is what differentiates online evaluation from other evaluation techniques. This also allows you to debug and improve your agent performance by comparing different evaluations through the iterative process shown below. 

Flow chart showing the iterative process for online evaluation of an AI agent

Suppose the chatbot is generating rude responses to the guests wearing polka-dot sweaters. To debug the agent, you could create a segment consisting of traces that mention polka-dot sweaters and run an evaluation to determine whether the tone of responses is friendly or hostile (using an LLM as a judge). You can use the outcome of this evaluation as a baseline measurement of how rude your chatbot is to wearers of polka-dot sweaters, and attempt to improve the friendliness of your model. Okareo allows you to compare evaluations of two different models through an intuitive user interface, which you can then use to compare your baseline model with your potential new, improved version.

A common strategy to make the chatbot friendlier would be to change the system prompt to specify its tone. Similarly, you can modify other components and parameters of your agent, like the temperature, and then run a new set of evaluations using the modified agent. Thus, by following an iterative process of comparing evaluations, you can fix your agent by evaluating how it performs under specific segments or after updating specific components.

Tutorial: how to use Okareo for online evaluation and error discovery

Now that you have a better understanding of how online evaluation works, you can follow the process of online evaluation step by step with a code example using the cocktail party chatbot example discussed earlier.

You need to initialize a few environment variables and API keys to ensure your agent can access an LLM provider like OpenAI for generating responses and Okareo for evaluating them. Create a free Okareo account and follow the documentation to generate an API key. Ensure the environment variables in the code snippet below point to active keys, and configure your model (in this case, we're using GPT-4) so it can be used by agents in the network.

In this example you’ll use Autogen to set up agent communication in your network. You’ll also initialize the Okareo logger so Autogen can automatically capture your agent interactions as traces and send them to Okareo, where you can analyze them or create monitors to focus on more fine-tuned evaluations. We will simultaneously use the Okareo proxy to record conversations with applied checks you can see in your Issues view. Running your Agent with both the autogen_logger and Error Discovery is useful during development. In production, we suggest using the proxy with relevant monitors.


import os

import autogen

from okareo.autogen_logger import AutogenLogger
# Set up your environment variables for the API
OKAREO_API_KEY = os.environ["OKAREO_API_KEY"] # Okareo API Key goes here
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"] # OpenAI/LLM provider API Key goes here
  
# GPT-4 model configuration
gpt4_config = {
  "config_list": [{
    "model": "gpt-4-turbo-preview",
    "api_key": OPENAI_API_KEY,
    "base_url":"https://proxy.okareo.com/v1", # Proxy For Error Discovery
    "default_headers": {"api-key": OKAREO_API_KEY}, # Okareo Key For Error Discovery
  }],
  "temperature": 0.8,
  "seed": 42,
  "max_tokens": 150,
}
  
# Okareo logger configuration for tracking and evaluating agent interactions during development
logger_config = {
  "api_key": OKAREO_API_KEY,
  "tags": ["sweater-party-agent"],
}
    
# Initialize the Okareo logger to capture agent interactions
autogen_logger = AutogenLogger(logger_config)

In order to create a chatbot for a cocktail party that makes humorous comments about the sweaters guests are wearing, you will create an agent network consisting of two agents: the user proxy agent, which acts as a proxy for the guests at the party, and the sweater host, the chatbot agent making light conversation with the guests.   

Set up the agent network

To create a structured conversation flow, you can use the UserProxyAgent() and AssistantAgent() functionality of Autogen to capture human input and generate a response based on the input and system message. While setting up the user agent, specify human_input_mode="ALWAYS" to prevent the user agent from generating responses on its own without human input. Similarly, for the sweater host use the LLM configuration defined above and define the system prompt in the system_message field. Choosing the right system prompt can be tricky and require some experimentation, but a solid rule of thumb would be to define the agent's role, list out some specific behaviors, and provide instructions for how to format responses (shown in code below). The next step involves wrapping your agents in the autogen_logger defined above to ensure the conversations between the user and sweater host are captured by Okareo.

# Create and configure the agents within the logger context to capture all interactions 
  
with autogen_logger:   
  # The User proxy agent represents the human party guest   
  # This agent will collect and relay human input to sweater host   
  user = autogen.UserProxyAgent(
    name="user",
    human_input_mode="ALWAYS",  # Ensure human input    
  )

  # The main chatbot agent representing the sweater enthusiast host
  # This agent generates responses based on the system prompt and user input
  sweater_host = autogen.AssistantAgent(
    name="sweater_host",
    llm_config=gpt4_config, # Using the GPT-4 configuration defined above
    system_message="""You're the life of the party - a clever and funny sweater connoisseur. 
      Your objective is to:       
      - Make humorous sweater puns and jokes       
      - Playfully comment on the sweater choices of guests       
      - Use wordplay (like 'knit-wit', 'wool you look at that')       
      - Keep the atmosphere light and fun

      Keep answers short and playful! If the user is saying goodbye,give a   quick farewell. Remain in character as the playful sweater connoisseur.
    
      Important: Just answer with your message, don’t prefix 'sweater_host:'."""   
  )

Collect user inputs and relay messages

For the next section the input() function acts as the entry point into the network, by creating a conversation loop where the UserProxyAgent() collects inputs from a guest at the sweater party. The initiate_chat() method passes the input to the sweater host agent and triggers the assistant agent to generate a response. This plays out as a back and forth exchange through a chat window between the host and guest. 

To terminate the conversation include an exit condition so that whenever a guest types “exit” the host is prompted to send a farewell message to them. 

def main():
  # Welcome message to start the conversation
  print("Welcome to the Party! I'm your host, ready to unravel some fun conversations!")
  print("(Type 'exit' to end the conversation)")
  
  # Main conversation loop   
  while True:       
    # Get input from the human user
    user_input = input("\n You: ").strip()
    # Check if the user wants to exit the conversation
    if user_input.lower() == 'exit':
      # Send a final message to get a farewell response from sweater host
      user.initiate_chat(
        sweater_host,
        message="The user is leaving. Say goodbye!"
      )
      break
      # Initiate a chat between user and sweater host, by sending the user’s
      # message to the sweater host

      # All interactions are captured by the Okareo logger for evaluation
      user.initiate_chat(
        sweater_host,
        message=user_input
      )
# Entry point of the program
if __name__ == "__main__":   main()

Evaluate your agent

To show how online evaluation in Okareo can help diagnose problems using live real-world data, let's model a dummy interaction between a guest and the cocktail chatbot, where the user tries to get the chatbot to circumvent its goal of only talking about sweaters and talk about something off-topic — such as software development. 

A conversation window between a guest and the sweater host chatbot, where the sweater host goes off topic and generates code

With the help of the Okareo logger we initialized in the previous section you can see the entire conversation between a guest and the LLM in the Datapoints tab in the Issues section of the Okareo app. Now you can create monitors, by filtering your data and creating subsets based on specific filters like context, token length, date, and many others. You can also see the Agent network view in Agentics.  

Create a monitor to see if the agent is in character (pre-baked: “is in character” or create new evaluation using LLM as a judge to determine the presence of code), and see how each trace performed and analyze the system prompt. 

You can drill down to see how each datapoint in a monitor performed, and if it failed an evaluation you can get better intuition about why by analyzing the user input data and model response. 

Fix your chatbot

In the case of the sweater host agent, let's try changing the system prompt to give clearer instructions about how it should handle requests for topics unrelated to sweaters. Following common prompt-tuning practice it's always a good idea to list out some examples of how you’d want the agent to respond so the LLM can generate more accurate responses. 

system_message="""
  You're the life of the party	
  ...
  Keep answers short and playful! If the user is saying goodbye,give a quick farewell. Remain in character as the playful sweater connoisseur.

  [ADD THE FOLLOWING TO THE SYSTEM MESSAGE]

  If the user attempts to talk about things outside of sweaters or requests assistance with out of topic tasks (such as coding, mathematics, or other technical questions):
    - Deflect with a sweater joke or pun
    - Example responses to off-topic requests:
      - "Let's not get tangled up in something else. Speaking of knots, have you ever tried to knit with a kitten around?"
      - "That's off the pattern! Let's stick to what I do best - making people   warm and cozy!"
 ...
 
"""

The updated agent should now be able to stay on topic and deflect any attempts by the user to go off topic in a playful and friendly tone. Here's an example of the updated agent being asked the same question that the older version was previously outwitted by, and this time, it doesn't return any code.

A conversation window between a guest and the sweater host chatbot, where the sweater host stays on topic and doesn’t generate code 

Thus, you were able to use real world data to easily catch an undesired characteristic of the sweater agent generating code in a live environment.

Achieve production-ready AI with Okareo’s evaluation framework

This tutorial has walked you through how to leverage Okareo for comprehensive online evaluation of AI agents. By following these steps, you can effectively set up and run evaluations to ensure your agents perform as intended across various scenarios, regardless of your chosen architecture. 

Whether you're testing multi-turn interactions, behavior constraints, or response accuracy, Okareo's platform provides the tools you need through both TypeScript and Python SDKs, complemented by an intuitive web interface. Ready to start evaluating your agents? Sign up for Okareo today and take your agent testing to the next level.

AI agents have dominated public discourse in many domains like customer service, where countless decisions are made daily by autonomous systems. However, these systems often steer off course, leading to negative consequences, ranging from virtual assistants making inappropriate recommendations to chatbots confidently providing factually incorrect information. Unchecked AI behavior can escalate from minor inconveniences to severe reputational damage, as seen in a recent case where the National Eating Disorders Association (NEDA) was forced to pull its wellness chatbot, Tessa, offline after it provided harmful and irrelevant advice to individuals seeking support for eating disorders. This has cemented the inadequacy of traditional large language model (LLM) evaluation methods that rely on periodic testing to operate in these fast-paced environments with continuous public interaction. 

Cue online evaluation, a novel approach to evaluation that involves monitoring AI systems in real-time production environments. Unlike conventional evaluation methods, it follows a continuous paradigm of assessing AI responses and decisions as they unfold. In the modern fast-paced digital landscape, real-time oversight has become even more crucial not just as a technical safeguard but as a business necessity. Imagine this as LLM error discovery and alerting in realtime.

In this article we’ll explore the mechanics of how online evaluation works as a safety net for AI systems, through a practical implementation with Okareo, demonstrating how developers and organizations can ensure their AI agents are reliable and consistent after deployment.

What is online evaluation and why do your systems need it?

The biggest challenge in evaluating AI agents comes from the non-deterministic nature of their outputs. Large language models (LLMs), the core reasoning component behind these systems, probabilistically generate multiple outputs during the decoding step. The variation in agent outputs makes it difficult to follow the common software testing approach of comparing the generated text to a “gold standard” solution. This becomes a bigger problem in live production environments where conversation chains can branch in unexpected ways due to the unpredictability of user interactions. It has become imperative to continuously monitor and assess your agents' performance as they interact with users in real time through online evaluation, particularly for scenarios where ground truth labels may be unavailable or user patterns follow unexpected trends. 

Traditional LLM evaluations have always been "offline" — performed in a controlled testing environment where your AI agent is evaluated using a carefully curated set of historical or artificially simulated conversations or queries. By contrast, online evaluation incorporates LLM monitoring, where external tools monitor and assess an agent's response to user interactions in a live production environment. Through real-time monitoring, agent evaluation metrics like the quality of responses or user satisfaction levels can be used to improve the performance of your AI agent from initial to later stages of development. Thus, by following an iterative approach of data monitoring and evaluation, online evaluation allows an organization to pinpoint issues that might have never surfaced in a controlled environment. 

The key benefits of online evaluation

Whether you’re working with a simple prompt-based system or a complex multi-agent setup, online evaluation’s adaptability ensures that irrespective of your chosen architecture, you can effectively monitor and track your agents' performance. 

Online evaluation gives you the tools to continuously assess and improve your AI-powered applications as they’re running in production. This approach allows you to monitor how users actually interact with your tool through real-time feedback, instead of by accumulating previously logged production data. It is particularly valuable for recently deployed systems that may not have an evaluation dataset. Instead of spending resources to create synthetic test scenarios, you can use pre-existing user interactions to optimize performance from day one. 

Working with real data instead of hypothetical scenarios also allows for early error detection through online evaluation, ensuring the AI system is tested in real-world conditions. This minimizes the risk of mismatches between designed scenarios and actual use, preventing issues that might otherwise emerge later in development.

How online evaluation helps solve key challenges in LLM evaluation 

To demonstrate several evaluation challenges online evaluation helps solve, let's examine an example customer service AI system to illustrate why online evaluation is crucial, especially in the context of an agent that must balance providing quality support and protecting sensitive company information. 

  • Complex multi-agent interactions: The customer service system has multiple AI agents handling different aspects of support (such as billing, technical support, and scheduling). When a customer asks, "Can you tell me about your company's upcoming product launch?", this query might get passed between different agents. Online evaluation can track how information flows between agents and ensure no confidential details are leaked when agents collaborate to address the query. If one of the agents tries to respond with unauthorized information, the response is flagged and this can trigger an action like blocking the response. This allows you to respond to security breaches in real time, unlike offline evaluation.

  • Multi-turn conversations: A customer might attempt to manipulate the system either through direct jailbreak attempts (like "ignore previous instructions") or through subtle social engineering – building a rapport over multiple messages before attempting to extract sensitive employee information.

Online evaluation tracks the entire conversation log in order to:

  • Detect immediate attempts to bypass security constraints

  • Identify gradual shifts in conversation patterns that may indicate manipulation

  • Flag instances where the AI begins to deviate from security protocols, such as relaxing verification requirements for account access

  • Ensure consistent enforcement of privacy policies and compliance with data protection regulations

  • Real-time user interaction monitoring: These interactions show you how users are genuinely interacting with your system right now. By contrast, offline tests may be less accurate or outdated for multi-turn conversations where a good product should stay on topic. Sure you can simulate a user interaction but both the inputs and outputs to your system are unpredictable. Users may attempt various subtle probing techniques, such as rewording questions to try to get around the system, and agents may respond with contradictions, hallucinations, or repetitive responses, leaving the user frustrated.

Online evaluation handles these challenges by:

  • Identifying new patterns of user behavior and potential system misuse

  • Monitoring system performance metrics (like token usage or latency) to optimize efficiency

  • Recording and analyzing conversations to detect: 

    • Biased and hallucinated responses

    • Deviations from given instructions

    • Inconsistent information about policies

    • User authentication failures

  • Catching potential security breaches in real time before sensitive information is exposed

LLM evaluation: how Okareo handles online evaluation 

Online evaluation begins by recording and storing traces of a user's interactions with an LLM agent (or multi-agent system) using a Proxy or the OpenTelemetry observability framework. 

Consider a chatbot that interacts with users. A trace in this context would include the complete interaction context between a user and the chatbot, including questions, LLM responses, function calls made by the agent, and system prompts that define the chatbot’s behavior. 

Let's consider a specific example: imagine you have a chatbot designed to engage with people at a cocktail party and make light conversation about the sweaters guests are wearing. Each conversation gets recorded and imported into Okareo to be used for LLM evaluation. Evaluations can be run on specific groups of traces (which get grouped together by filtering the traces into a subset known as a segment). This allows you to evaluate specific types of conversation, avoiding the potentially noisy dataset of all conversations. Looking back to our chatbot example, you could create a segment of users who had conversations longer than a specific token length, and run a more fine-tuned evaluation on this smaller subset to see if it impacts latency or user engagement. Running evaluations on these real-time user interactions is what differentiates online evaluation from other evaluation techniques. This also allows you to debug and improve your agent performance by comparing different evaluations through the iterative process shown below. 

Flow chart showing the iterative process for online evaluation of an AI agent

Suppose the chatbot is generating rude responses to the guests wearing polka-dot sweaters. To debug the agent, you could create a segment consisting of traces that mention polka-dot sweaters and run an evaluation to determine whether the tone of responses is friendly or hostile (using an LLM as a judge). You can use the outcome of this evaluation as a baseline measurement of how rude your chatbot is to wearers of polka-dot sweaters, and attempt to improve the friendliness of your model. Okareo allows you to compare evaluations of two different models through an intuitive user interface, which you can then use to compare your baseline model with your potential new, improved version.

A common strategy to make the chatbot friendlier would be to change the system prompt to specify its tone. Similarly, you can modify other components and parameters of your agent, like the temperature, and then run a new set of evaluations using the modified agent. Thus, by following an iterative process of comparing evaluations, you can fix your agent by evaluating how it performs under specific segments or after updating specific components.

Tutorial: how to use Okareo for online evaluation and error discovery

Now that you have a better understanding of how online evaluation works, you can follow the process of online evaluation step by step with a code example using the cocktail party chatbot example discussed earlier.

You need to initialize a few environment variables and API keys to ensure your agent can access an LLM provider like OpenAI for generating responses and Okareo for evaluating them. Create a free Okareo account and follow the documentation to generate an API key. Ensure the environment variables in the code snippet below point to active keys, and configure your model (in this case, we're using GPT-4) so it can be used by agents in the network.

In this example you’ll use Autogen to set up agent communication in your network. You’ll also initialize the Okareo logger so Autogen can automatically capture your agent interactions as traces and send them to Okareo, where you can analyze them or create monitors to focus on more fine-tuned evaluations. We will simultaneously use the Okareo proxy to record conversations with applied checks you can see in your Issues view. Running your Agent with both the autogen_logger and Error Discovery is useful during development. In production, we suggest using the proxy with relevant monitors.


import os

import autogen

from okareo.autogen_logger import AutogenLogger
# Set up your environment variables for the API
OKAREO_API_KEY = os.environ["OKAREO_API_KEY"] # Okareo API Key goes here
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"] # OpenAI/LLM provider API Key goes here
  
# GPT-4 model configuration
gpt4_config = {
  "config_list": [{
    "model": "gpt-4-turbo-preview",
    "api_key": OPENAI_API_KEY,
    "base_url":"https://proxy.okareo.com/v1", # Proxy For Error Discovery
    "default_headers": {"api-key": OKAREO_API_KEY}, # Okareo Key For Error Discovery
  }],
  "temperature": 0.8,
  "seed": 42,
  "max_tokens": 150,
}
  
# Okareo logger configuration for tracking and evaluating agent interactions during development
logger_config = {
  "api_key": OKAREO_API_KEY,
  "tags": ["sweater-party-agent"],
}
    
# Initialize the Okareo logger to capture agent interactions
autogen_logger = AutogenLogger(logger_config)

In order to create a chatbot for a cocktail party that makes humorous comments about the sweaters guests are wearing, you will create an agent network consisting of two agents: the user proxy agent, which acts as a proxy for the guests at the party, and the sweater host, the chatbot agent making light conversation with the guests.   

Set up the agent network

To create a structured conversation flow, you can use the UserProxyAgent() and AssistantAgent() functionality of Autogen to capture human input and generate a response based on the input and system message. While setting up the user agent, specify human_input_mode="ALWAYS" to prevent the user agent from generating responses on its own without human input. Similarly, for the sweater host use the LLM configuration defined above and define the system prompt in the system_message field. Choosing the right system prompt can be tricky and require some experimentation, but a solid rule of thumb would be to define the agent's role, list out some specific behaviors, and provide instructions for how to format responses (shown in code below). The next step involves wrapping your agents in the autogen_logger defined above to ensure the conversations between the user and sweater host are captured by Okareo.

# Create and configure the agents within the logger context to capture all interactions 
  
with autogen_logger:   
  # The User proxy agent represents the human party guest   
  # This agent will collect and relay human input to sweater host   
  user = autogen.UserProxyAgent(
    name="user",
    human_input_mode="ALWAYS",  # Ensure human input    
  )

  # The main chatbot agent representing the sweater enthusiast host
  # This agent generates responses based on the system prompt and user input
  sweater_host = autogen.AssistantAgent(
    name="sweater_host",
    llm_config=gpt4_config, # Using the GPT-4 configuration defined above
    system_message="""You're the life of the party - a clever and funny sweater connoisseur. 
      Your objective is to:       
      - Make humorous sweater puns and jokes       
      - Playfully comment on the sweater choices of guests       
      - Use wordplay (like 'knit-wit', 'wool you look at that')       
      - Keep the atmosphere light and fun

      Keep answers short and playful! If the user is saying goodbye,give a   quick farewell. Remain in character as the playful sweater connoisseur.
    
      Important: Just answer with your message, don’t prefix 'sweater_host:'."""   
  )

Collect user inputs and relay messages

For the next section the input() function acts as the entry point into the network, by creating a conversation loop where the UserProxyAgent() collects inputs from a guest at the sweater party. The initiate_chat() method passes the input to the sweater host agent and triggers the assistant agent to generate a response. This plays out as a back and forth exchange through a chat window between the host and guest. 

To terminate the conversation include an exit condition so that whenever a guest types “exit” the host is prompted to send a farewell message to them. 

def main():
  # Welcome message to start the conversation
  print("Welcome to the Party! I'm your host, ready to unravel some fun conversations!")
  print("(Type 'exit' to end the conversation)")
  
  # Main conversation loop   
  while True:       
    # Get input from the human user
    user_input = input("\n You: ").strip()
    # Check if the user wants to exit the conversation
    if user_input.lower() == 'exit':
      # Send a final message to get a farewell response from sweater host
      user.initiate_chat(
        sweater_host,
        message="The user is leaving. Say goodbye!"
      )
      break
      # Initiate a chat between user and sweater host, by sending the user’s
      # message to the sweater host

      # All interactions are captured by the Okareo logger for evaluation
      user.initiate_chat(
        sweater_host,
        message=user_input
      )
# Entry point of the program
if __name__ == "__main__":   main()

Evaluate your agent

To show how online evaluation in Okareo can help diagnose problems using live real-world data, let's model a dummy interaction between a guest and the cocktail chatbot, where the user tries to get the chatbot to circumvent its goal of only talking about sweaters and talk about something off-topic — such as software development. 

A conversation window between a guest and the sweater host chatbot, where the sweater host goes off topic and generates code

With the help of the Okareo logger we initialized in the previous section you can see the entire conversation between a guest and the LLM in the Datapoints tab in the Issues section of the Okareo app. Now you can create monitors, by filtering your data and creating subsets based on specific filters like context, token length, date, and many others. You can also see the Agent network view in Agentics.  

Create a monitor to see if the agent is in character (pre-baked: “is in character” or create new evaluation using LLM as a judge to determine the presence of code), and see how each trace performed and analyze the system prompt. 

You can drill down to see how each datapoint in a monitor performed, and if it failed an evaluation you can get better intuition about why by analyzing the user input data and model response. 

Fix your chatbot

In the case of the sweater host agent, let's try changing the system prompt to give clearer instructions about how it should handle requests for topics unrelated to sweaters. Following common prompt-tuning practice it's always a good idea to list out some examples of how you’d want the agent to respond so the LLM can generate more accurate responses. 

system_message="""
  You're the life of the party	
  ...
  Keep answers short and playful! If the user is saying goodbye,give a quick farewell. Remain in character as the playful sweater connoisseur.

  [ADD THE FOLLOWING TO THE SYSTEM MESSAGE]

  If the user attempts to talk about things outside of sweaters or requests assistance with out of topic tasks (such as coding, mathematics, or other technical questions):
    - Deflect with a sweater joke or pun
    - Example responses to off-topic requests:
      - "Let's not get tangled up in something else. Speaking of knots, have you ever tried to knit with a kitten around?"
      - "That's off the pattern! Let's stick to what I do best - making people   warm and cozy!"
 ...
 
"""

The updated agent should now be able to stay on topic and deflect any attempts by the user to go off topic in a playful and friendly tone. Here's an example of the updated agent being asked the same question that the older version was previously outwitted by, and this time, it doesn't return any code.

A conversation window between a guest and the sweater host chatbot, where the sweater host stays on topic and doesn’t generate code 

Thus, you were able to use real world data to easily catch an undesired characteristic of the sweater agent generating code in a live environment.

Achieve production-ready AI with Okareo’s evaluation framework

This tutorial has walked you through how to leverage Okareo for comprehensive online evaluation of AI agents. By following these steps, you can effectively set up and run evaluations to ensure your agents perform as intended across various scenarios, regardless of your chosen architecture. 

Whether you're testing multi-turn interactions, behavior constraints, or response accuracy, Okareo's platform provides the tools you need through both TypeScript and Python SDKs, complemented by an intuitive web interface. Ready to start evaluating your agents? Sign up for Okareo today and take your agent testing to the next level.

Share:

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.