How to use LLM Observability and Monitoring to Evaluate and Fine-tune LLMs

Evaluation

Matt Wyman

,

CEO and Co-founder

Sarah Barber

,

Senior Technical Content Writer

February 3, 2025

Evaluating LLMs is necessary if you want to ensure they perform well in production environments. It gives you the information you need to make targeted changes to your LLM app, such as improving the system prompt, or fine-tuning when needed. The best way to do either evaluation or fine-tuning is to use real-world user inputs and model responses (recorded using observability and monitoring tools).

By collecting and analyzing telemetry data from your applications and using this data to evaluate specific parts of your app, you can gradually identify areas for improvement and then either make changes to your system prompt or, where that falls short, fine-tune a new model version. These changes can then be deployed as updates without any downtime. This gives you a way to incrementally enhance performance rather than waiting until a model produces 100% perfect results before you can release your product.

In this article, we’ll explore how LLM observability and monitoring data can be used for evaluation and for identifying areas for fine-tuning, and how to use Okareo to easily set up LLM evaluations using this data.

What is LLM observability and monitoring?

LLM observability is the practice of collecting, analyzing, and understanding your LLM application's telemetry data, including unstructured data like traces and logs and structured data in the form of predefined metrics. 

The process of recording structured metrics is called monitoring. LLM monitoring data is typically metadata that includes metrics like latency, number of tokens in the request or response, or the cost per request. This quantitative data can be used to trigger alerts when there's an anomaly or failure, or when anything goes beyond an agreed-upon threshold. For example, if the output response always has a high number of tokens, this suggests you could make your model more concise by changing its system prompt or by fine-tuning. 

LLM observability, being a broader concept than LLM monitoring, allows you to move beyond simple statements of fact and find out why something happened within your application, providing insights into your application's overall behavior and performance. As part of LLM observability, you can track conversations between a user and your system, or between agents in the back end of your system.

This allows you to debug errors or strange behaviors, such as bias, hallucinations, or the LLM failing to follow a user's request, in real time. You can also use these tracked conversions to run evaluations on your LLM and improve its quality by finding ways to improve its responses to a wide variety of queries.

How do LLM observability and monitoring help with LLM evaluation?

LLM observability and monitoring provide a convenient way of collecting telemetry data and organizing it in a useful way so it can be used as data points in your LLM evaluation. A data point is any piece of information used to evaluate your LLM's performance, including user input queries, model outputs, custom system prompts (for context), prior conversation history, or metadata — like response latency, number of tokens in the input or output, error frequency, or financial cost.

You may already be doing LLM evaluation, and you may have manually created some datapoints for this or used synthetic data generation, but now you can use the large amounts of telemetry data to automatically create realistic datapoints for your LLM evaluations (once you've stripped out any personally identifiable information (PII)). More data is useful, as your new evaluation will be able to tell you with more accuracy how well your system is performing.

If you find out that once you have more tests in place your LLM's performance worsens, you can try to improve performance with prompt tuning (adjusting your custom system prompt to try to get the LLM to produce better results for these new use cases) or by fine tuning the model. 

Fine tuning isn't just for data scientists — software engineers can (and should) do this themselves, especially once they've exhausted all their ideas for prompt tuning. Model providers like OpenAI usually provide an API for this that allows you to pass in a carefully prepared training dataset (consisting of inputs and expected model outputs). Note that fine-tuning datasets should be separate to the ones you used for evaluation or you may end up overfitting your results. The model will internally adjust its weights until each input produces its corresponding output more reliably. 

Why you need LLM evaluation and fine-tuning

In order to ensure production reliability, you need LLM evaluation to give you confidence that your app consistently produces good results. You'll need to simulate a variety of inputs in order to determine whether your LLM will respond appropriately to edge cases, and if it doesn't, you can make changes to your system prompt or use fine-tuning to improve its ability.

You can use LLM evaluation to identify issues like bias, security vulnerabilities, or hallucinations, and use this information to improve your model, which protects your application. Regularly running LLM evaluations also helps prevent regressions caused by changes to prompts, updates to the LLM, or model drift, which can happen over time.

Using LLM observability and monitoring data in Okareo's LLM evaluations

Okareo is a tool for running LLM evaluations, which typically consist of a series of checks and test datapoints. A check could be something like how friendly the LLM output is or whether it's in JSON format. Okareo has a wide variety of pre-baked checks that you can use, but you can also create your own custom checks. A test datapoint is a sample user input, sometimes paired with an example of a "gold standard" model output for that particular input.

Now, with Okareo's new "proxy" feature, these test datapoints can come directly from observability and monitoring data from your app. The okareo proxy command inserts a proxy (in this case, LiteLLM) between your app and your LLM, recording all traces and spans as they occur and sending them to Okareo. Within the field of observability, a trace represents the complete workflow of a request as it moves through a distributed system, and a span is just one step within a trace. Each span gets recorded as a separate datapoint in Okareo.

Within the Okareo app, you can then use filters to create a segment of data. Examples of filters include:

  • Token length: Spans where the input text length exceeds 1,000 tokens

  • Keyword presence: Spans where the input query contains a specific key word or phrase such as "summarize" or "translate"

  • Latency: Spans where the latency exceeds 1000ms

Once you have the segment you're interested in, you can run an LLM evaluation using that segment of data.

How to use LLM observability and monitoring data for LLM evaluation using Okareo

Here, we explain how to pull your observability and monitoring spans into Okareo and run an evaluation on a subset of these.

Set up a proxy to record all datapoints in Okareo

First, create a free Okareo account. Within the Okareo app, create an Okareo API key and ensure that it is set as an environment variable, along with the API key for your LLM provider such as OpenAI or Anthropic.

export OKAREO_API_KEY=<YOUR_OKAREO_API_KEY>

Next, install the Okareo CLI tool. Once you've run the commands below, check that it installed correctly with the okareo -v command.

curl -O -L https://github.com/okareo-ai/okareo-cli/releases/download/v0.0.20/okareo_0.0.20_linux_386.tar.gz
tar -xvf okareo_0.0.20_linux_386.tar.gz
export PATH="$PATH:$PWD/bin"

Since Okareo uses the LiteLLM proxy, this needs to be installed. Assuming you already have Python and Pip installed, you can install LiteLLM with the pip install litellm command. Note that depending on whether this gets installed globally or at a user level, you may need to add LiteLLM to your PATH. 

Next, start Okareo with the proxy with the okareo proxy command. By default, the proxy will run on localhost:4000.

Finally, update your app so that any requests that were previously being sent to your LLM provider now go via the proxy. This is usually as simple as setting your base_url in your application code to point to the proxy's host:port — typically this will be localhost:4000. 

Check LLM calls are arriving in the Okareo app as datapoints

Before relying on running your application with the new proxy, you should test that it's working. Try running a cURL request to the endpoint in a new terminal window.

curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a customer support assistant trained to summarize user complaints and route them to the appropriate department."
    },
    {
      "role": "user",
      "content": "Why was I charged twice for my order? I need a refund."
    }
  ]
}

In the Okareo app, you should see a new datapoint appear. Clicking on it allows you to view the conversation between the user and the LLM.

Screen capture of the "Datapoints" tab in the Okareo app.

Let's run two more cURL requests, each with a different customer complaint to the first:

curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a customer support assistant trained to summarize user complaints and route them to the appropriate department."
    },
    {
      "role": "user",
      "content": "The app keeps freezing when I try to log in. Please fix it."
    }
  ]
}'
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a customer support assistant trained to summarize user complaints and route them to the appropriate department."
    },
    {
      "role": "user",
      "content": "I am so frustrated right now. I’ve been a customer for over 5 years and the app has been getting worse and worse. I was trying to place an order for over an hour yesterday, but the app kept freezing. Then I tried to restart it, but it logged me out and wouldn’t accept my password even though I know it’s correct. After multiple attempts, I had to reset my password, which was a hassle because the reset email didn’t arrive for 30 minutes. When I finally got back in, the item I wanted was out of stock. This is not the first time I’ve had issues with your app either. Last month, there was a payment problem where it double-charged my credit card, and I had to spend hours on the phone with customer service to get it resolved. I’ve been loyal to this company, but I don’t know if it’s worth the hassle anymore. I expect better from a company of your reputation. What are you going to do to fix this mess?"
    }
  ]
}'

The spans from all three of these requests should now exist in your datapoints tab in the Okareo app.

Segmenting your data for specific evaluations

Within your datapoints tab, you can filter your data to create a subset of it, and save this as a segment, i.e. a filtered subset of your data. In the video below, we create a segment of datapoints where the latency was more than 1000 ms.

Screen capture of a segment being created in the Okareo app

Once you create a segment, you'll be taken to the Segments tab. Note that as well as the segment you just created, there are some pre-existing Okareo segments such as "Okareo task: summarization" and "Okareo task: role playing." These segments will automatically get populated with datapoints that fit the category. For example, our datapoints have been categorized as "role playing" because the agent is playing the role of a customer support assistant. This could be useful if you wanted to evaluate the tone and style of a specific agent, or if you wanted to see if an agent can maintain this role over a number of conversation turns.

Screen capture of the"role playing" segment in the Okareo app

Running an LLM evaluation

From the Segments tab in the Okareo app, you can run an evaluation on a particular segment by clicking the Run button in the Actions column of that segment's row. You'll be able to name your evaluation run and select the checks you want to run on the evaluation. Okareo will have some checks pre-selected that are likely to be relevant — for the role playing segment, it selects a check called "is in character." You can also add your own checks — here we add one that we created earlier that checks for friendliness:

Screen capture showing how to run an evaluation on a segment in the Okareo app

Using the LLM online evaluation results to improve your app

After a few seconds, the evaluation run is complete, and you can view the results by clicking on the last evaluation run associated with the segment.

Screenshot showing the LLM evaluation results of the Segment in the Okareo app

At the top of the page you can see the average result for each check, but farther down you can see how each datapoint fared. If a datapoint has failed a particular check, you can click on it to see the details of the user input data and the actual model response. This allows you to drill down and understand why it failed and adapt your system accordingly. This could be by changing the system prompt, giving the user clearer instructions, or by fine-tuning your LLM.

Screenshot showing the LLM evaluation results of the Segment in the Okareo app, with details for each datapoint

Take your LLM apps to the next level with Okareo

Evaluating and fine-tuning LLMs doesn’t have to be as complicated or time-consuming as some people make it. If you use LLM observability and LLM monitoring tools to collect real-world data, you can use evaluations to pinpoint areas for improvement. This allows you to continuously optimize your models while still having your applications running smoothly in production. 

Sign up for Okareo

Evaluating LLMs is necessary if you want to ensure they perform well in production environments. It gives you the information you need to make targeted changes to your LLM app, such as improving the system prompt, or fine-tuning when needed. The best way to do either evaluation or fine-tuning is to use real-world user inputs and model responses (recorded using observability and monitoring tools).

By collecting and analyzing telemetry data from your applications and using this data to evaluate specific parts of your app, you can gradually identify areas for improvement and then either make changes to your system prompt or, where that falls short, fine-tune a new model version. These changes can then be deployed as updates without any downtime. This gives you a way to incrementally enhance performance rather than waiting until a model produces 100% perfect results before you can release your product.

In this article, we’ll explore how LLM observability and monitoring data can be used for evaluation and for identifying areas for fine-tuning, and how to use Okareo to easily set up LLM evaluations using this data.

What is LLM observability and monitoring?

LLM observability is the practice of collecting, analyzing, and understanding your LLM application's telemetry data, including unstructured data like traces and logs and structured data in the form of predefined metrics. 

The process of recording structured metrics is called monitoring. LLM monitoring data is typically metadata that includes metrics like latency, number of tokens in the request or response, or the cost per request. This quantitative data can be used to trigger alerts when there's an anomaly or failure, or when anything goes beyond an agreed-upon threshold. For example, if the output response always has a high number of tokens, this suggests you could make your model more concise by changing its system prompt or by fine-tuning. 

LLM observability, being a broader concept than LLM monitoring, allows you to move beyond simple statements of fact and find out why something happened within your application, providing insights into your application's overall behavior and performance. As part of LLM observability, you can track conversations between a user and your system, or between agents in the back end of your system.

This allows you to debug errors or strange behaviors, such as bias, hallucinations, or the LLM failing to follow a user's request, in real time. You can also use these tracked conversions to run evaluations on your LLM and improve its quality by finding ways to improve its responses to a wide variety of queries.

How do LLM observability and monitoring help with LLM evaluation?

LLM observability and monitoring provide a convenient way of collecting telemetry data and organizing it in a useful way so it can be used as data points in your LLM evaluation. A data point is any piece of information used to evaluate your LLM's performance, including user input queries, model outputs, custom system prompts (for context), prior conversation history, or metadata — like response latency, number of tokens in the input or output, error frequency, or financial cost.

You may already be doing LLM evaluation, and you may have manually created some datapoints for this or used synthetic data generation, but now you can use the large amounts of telemetry data to automatically create realistic datapoints for your LLM evaluations (once you've stripped out any personally identifiable information (PII)). More data is useful, as your new evaluation will be able to tell you with more accuracy how well your system is performing.

If you find out that once you have more tests in place your LLM's performance worsens, you can try to improve performance with prompt tuning (adjusting your custom system prompt to try to get the LLM to produce better results for these new use cases) or by fine tuning the model. 

Fine tuning isn't just for data scientists — software engineers can (and should) do this themselves, especially once they've exhausted all their ideas for prompt tuning. Model providers like OpenAI usually provide an API for this that allows you to pass in a carefully prepared training dataset (consisting of inputs and expected model outputs). Note that fine-tuning datasets should be separate to the ones you used for evaluation or you may end up overfitting your results. The model will internally adjust its weights until each input produces its corresponding output more reliably. 

Why you need LLM evaluation and fine-tuning

In order to ensure production reliability, you need LLM evaluation to give you confidence that your app consistently produces good results. You'll need to simulate a variety of inputs in order to determine whether your LLM will respond appropriately to edge cases, and if it doesn't, you can make changes to your system prompt or use fine-tuning to improve its ability.

You can use LLM evaluation to identify issues like bias, security vulnerabilities, or hallucinations, and use this information to improve your model, which protects your application. Regularly running LLM evaluations also helps prevent regressions caused by changes to prompts, updates to the LLM, or model drift, which can happen over time.

Using LLM observability and monitoring data in Okareo's LLM evaluations

Okareo is a tool for running LLM evaluations, which typically consist of a series of checks and test datapoints. A check could be something like how friendly the LLM output is or whether it's in JSON format. Okareo has a wide variety of pre-baked checks that you can use, but you can also create your own custom checks. A test datapoint is a sample user input, sometimes paired with an example of a "gold standard" model output for that particular input.

Now, with Okareo's new "proxy" feature, these test datapoints can come directly from observability and monitoring data from your app. The okareo proxy command inserts a proxy (in this case, LiteLLM) between your app and your LLM, recording all traces and spans as they occur and sending them to Okareo. Within the field of observability, a trace represents the complete workflow of a request as it moves through a distributed system, and a span is just one step within a trace. Each span gets recorded as a separate datapoint in Okareo.

Within the Okareo app, you can then use filters to create a segment of data. Examples of filters include:

  • Token length: Spans where the input text length exceeds 1,000 tokens

  • Keyword presence: Spans where the input query contains a specific key word or phrase such as "summarize" or "translate"

  • Latency: Spans where the latency exceeds 1000ms

Once you have the segment you're interested in, you can run an LLM evaluation using that segment of data.

How to use LLM observability and monitoring data for LLM evaluation using Okareo

Here, we explain how to pull your observability and monitoring spans into Okareo and run an evaluation on a subset of these.

Set up a proxy to record all datapoints in Okareo

First, create a free Okareo account. Within the Okareo app, create an Okareo API key and ensure that it is set as an environment variable, along with the API key for your LLM provider such as OpenAI or Anthropic.

export OKAREO_API_KEY=<YOUR_OKAREO_API_KEY>

Next, install the Okareo CLI tool. Once you've run the commands below, check that it installed correctly with the okareo -v command.

curl -O -L https://github.com/okareo-ai/okareo-cli/releases/download/v0.0.20/okareo_0.0.20_linux_386.tar.gz
tar -xvf okareo_0.0.20_linux_386.tar.gz
export PATH="$PATH:$PWD/bin"

Since Okareo uses the LiteLLM proxy, this needs to be installed. Assuming you already have Python and Pip installed, you can install LiteLLM with the pip install litellm command. Note that depending on whether this gets installed globally or at a user level, you may need to add LiteLLM to your PATH. 

Next, start Okareo with the proxy with the okareo proxy command. By default, the proxy will run on localhost:4000.

Finally, update your app so that any requests that were previously being sent to your LLM provider now go via the proxy. This is usually as simple as setting your base_url in your application code to point to the proxy's host:port — typically this will be localhost:4000. 

Check LLM calls are arriving in the Okareo app as datapoints

Before relying on running your application with the new proxy, you should test that it's working. Try running a cURL request to the endpoint in a new terminal window.

curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a customer support assistant trained to summarize user complaints and route them to the appropriate department."
    },
    {
      "role": "user",
      "content": "Why was I charged twice for my order? I need a refund."
    }
  ]
}

In the Okareo app, you should see a new datapoint appear. Clicking on it allows you to view the conversation between the user and the LLM.

Screen capture of the "Datapoints" tab in the Okareo app.

Let's run two more cURL requests, each with a different customer complaint to the first:

curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a customer support assistant trained to summarize user complaints and route them to the appropriate department."
    },
    {
      "role": "user",
      "content": "The app keeps freezing when I try to log in. Please fix it."
    }
  ]
}'
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a customer support assistant trained to summarize user complaints and route them to the appropriate department."
    },
    {
      "role": "user",
      "content": "I am so frustrated right now. I’ve been a customer for over 5 years and the app has been getting worse and worse. I was trying to place an order for over an hour yesterday, but the app kept freezing. Then I tried to restart it, but it logged me out and wouldn’t accept my password even though I know it’s correct. After multiple attempts, I had to reset my password, which was a hassle because the reset email didn’t arrive for 30 minutes. When I finally got back in, the item I wanted was out of stock. This is not the first time I’ve had issues with your app either. Last month, there was a payment problem where it double-charged my credit card, and I had to spend hours on the phone with customer service to get it resolved. I’ve been loyal to this company, but I don’t know if it’s worth the hassle anymore. I expect better from a company of your reputation. What are you going to do to fix this mess?"
    }
  ]
}'

The spans from all three of these requests should now exist in your datapoints tab in the Okareo app.

Segmenting your data for specific evaluations

Within your datapoints tab, you can filter your data to create a subset of it, and save this as a segment, i.e. a filtered subset of your data. In the video below, we create a segment of datapoints where the latency was more than 1000 ms.

Screen capture of a segment being created in the Okareo app

Once you create a segment, you'll be taken to the Segments tab. Note that as well as the segment you just created, there are some pre-existing Okareo segments such as "Okareo task: summarization" and "Okareo task: role playing." These segments will automatically get populated with datapoints that fit the category. For example, our datapoints have been categorized as "role playing" because the agent is playing the role of a customer support assistant. This could be useful if you wanted to evaluate the tone and style of a specific agent, or if you wanted to see if an agent can maintain this role over a number of conversation turns.

Screen capture of the"role playing" segment in the Okareo app

Running an LLM evaluation

From the Segments tab in the Okareo app, you can run an evaluation on a particular segment by clicking the Run button in the Actions column of that segment's row. You'll be able to name your evaluation run and select the checks you want to run on the evaluation. Okareo will have some checks pre-selected that are likely to be relevant — for the role playing segment, it selects a check called "is in character." You can also add your own checks — here we add one that we created earlier that checks for friendliness:

Screen capture showing how to run an evaluation on a segment in the Okareo app

Using the LLM online evaluation results to improve your app

After a few seconds, the evaluation run is complete, and you can view the results by clicking on the last evaluation run associated with the segment.

Screenshot showing the LLM evaluation results of the Segment in the Okareo app

At the top of the page you can see the average result for each check, but farther down you can see how each datapoint fared. If a datapoint has failed a particular check, you can click on it to see the details of the user input data and the actual model response. This allows you to drill down and understand why it failed and adapt your system accordingly. This could be by changing the system prompt, giving the user clearer instructions, or by fine-tuning your LLM.

Screenshot showing the LLM evaluation results of the Segment in the Okareo app, with details for each datapoint

Take your LLM apps to the next level with Okareo

Evaluating and fine-tuning LLMs doesn’t have to be as complicated or time-consuming as some people make it. If you use LLM observability and LLM monitoring tools to collect real-world data, you can use evaluations to pinpoint areas for improvement. This allows you to continuously optimize your models while still having your applications running smoothly in production. 

Sign up for Okareo

Evaluating LLMs is necessary if you want to ensure they perform well in production environments. It gives you the information you need to make targeted changes to your LLM app, such as improving the system prompt, or fine-tuning when needed. The best way to do either evaluation or fine-tuning is to use real-world user inputs and model responses (recorded using observability and monitoring tools).

By collecting and analyzing telemetry data from your applications and using this data to evaluate specific parts of your app, you can gradually identify areas for improvement and then either make changes to your system prompt or, where that falls short, fine-tune a new model version. These changes can then be deployed as updates without any downtime. This gives you a way to incrementally enhance performance rather than waiting until a model produces 100% perfect results before you can release your product.

In this article, we’ll explore how LLM observability and monitoring data can be used for evaluation and for identifying areas for fine-tuning, and how to use Okareo to easily set up LLM evaluations using this data.

What is LLM observability and monitoring?

LLM observability is the practice of collecting, analyzing, and understanding your LLM application's telemetry data, including unstructured data like traces and logs and structured data in the form of predefined metrics. 

The process of recording structured metrics is called monitoring. LLM monitoring data is typically metadata that includes metrics like latency, number of tokens in the request or response, or the cost per request. This quantitative data can be used to trigger alerts when there's an anomaly or failure, or when anything goes beyond an agreed-upon threshold. For example, if the output response always has a high number of tokens, this suggests you could make your model more concise by changing its system prompt or by fine-tuning. 

LLM observability, being a broader concept than LLM monitoring, allows you to move beyond simple statements of fact and find out why something happened within your application, providing insights into your application's overall behavior and performance. As part of LLM observability, you can track conversations between a user and your system, or between agents in the back end of your system.

This allows you to debug errors or strange behaviors, such as bias, hallucinations, or the LLM failing to follow a user's request, in real time. You can also use these tracked conversions to run evaluations on your LLM and improve its quality by finding ways to improve its responses to a wide variety of queries.

How do LLM observability and monitoring help with LLM evaluation?

LLM observability and monitoring provide a convenient way of collecting telemetry data and organizing it in a useful way so it can be used as data points in your LLM evaluation. A data point is any piece of information used to evaluate your LLM's performance, including user input queries, model outputs, custom system prompts (for context), prior conversation history, or metadata — like response latency, number of tokens in the input or output, error frequency, or financial cost.

You may already be doing LLM evaluation, and you may have manually created some datapoints for this or used synthetic data generation, but now you can use the large amounts of telemetry data to automatically create realistic datapoints for your LLM evaluations (once you've stripped out any personally identifiable information (PII)). More data is useful, as your new evaluation will be able to tell you with more accuracy how well your system is performing.

If you find out that once you have more tests in place your LLM's performance worsens, you can try to improve performance with prompt tuning (adjusting your custom system prompt to try to get the LLM to produce better results for these new use cases) or by fine tuning the model. 

Fine tuning isn't just for data scientists — software engineers can (and should) do this themselves, especially once they've exhausted all their ideas for prompt tuning. Model providers like OpenAI usually provide an API for this that allows you to pass in a carefully prepared training dataset (consisting of inputs and expected model outputs). Note that fine-tuning datasets should be separate to the ones you used for evaluation or you may end up overfitting your results. The model will internally adjust its weights until each input produces its corresponding output more reliably. 

Why you need LLM evaluation and fine-tuning

In order to ensure production reliability, you need LLM evaluation to give you confidence that your app consistently produces good results. You'll need to simulate a variety of inputs in order to determine whether your LLM will respond appropriately to edge cases, and if it doesn't, you can make changes to your system prompt or use fine-tuning to improve its ability.

You can use LLM evaluation to identify issues like bias, security vulnerabilities, or hallucinations, and use this information to improve your model, which protects your application. Regularly running LLM evaluations also helps prevent regressions caused by changes to prompts, updates to the LLM, or model drift, which can happen over time.

Using LLM observability and monitoring data in Okareo's LLM evaluations

Okareo is a tool for running LLM evaluations, which typically consist of a series of checks and test datapoints. A check could be something like how friendly the LLM output is or whether it's in JSON format. Okareo has a wide variety of pre-baked checks that you can use, but you can also create your own custom checks. A test datapoint is a sample user input, sometimes paired with an example of a "gold standard" model output for that particular input.

Now, with Okareo's new "proxy" feature, these test datapoints can come directly from observability and monitoring data from your app. The okareo proxy command inserts a proxy (in this case, LiteLLM) between your app and your LLM, recording all traces and spans as they occur and sending them to Okareo. Within the field of observability, a trace represents the complete workflow of a request as it moves through a distributed system, and a span is just one step within a trace. Each span gets recorded as a separate datapoint in Okareo.

Within the Okareo app, you can then use filters to create a segment of data. Examples of filters include:

  • Token length: Spans where the input text length exceeds 1,000 tokens

  • Keyword presence: Spans where the input query contains a specific key word or phrase such as "summarize" or "translate"

  • Latency: Spans where the latency exceeds 1000ms

Once you have the segment you're interested in, you can run an LLM evaluation using that segment of data.

How to use LLM observability and monitoring data for LLM evaluation using Okareo

Here, we explain how to pull your observability and monitoring spans into Okareo and run an evaluation on a subset of these.

Set up a proxy to record all datapoints in Okareo

First, create a free Okareo account. Within the Okareo app, create an Okareo API key and ensure that it is set as an environment variable, along with the API key for your LLM provider such as OpenAI or Anthropic.

export OKAREO_API_KEY=<YOUR_OKAREO_API_KEY>

Next, install the Okareo CLI tool. Once you've run the commands below, check that it installed correctly with the okareo -v command.

curl -O -L https://github.com/okareo-ai/okareo-cli/releases/download/v0.0.20/okareo_0.0.20_linux_386.tar.gz
tar -xvf okareo_0.0.20_linux_386.tar.gz
export PATH="$PATH:$PWD/bin"

Since Okareo uses the LiteLLM proxy, this needs to be installed. Assuming you already have Python and Pip installed, you can install LiteLLM with the pip install litellm command. Note that depending on whether this gets installed globally or at a user level, you may need to add LiteLLM to your PATH. 

Next, start Okareo with the proxy with the okareo proxy command. By default, the proxy will run on localhost:4000.

Finally, update your app so that any requests that were previously being sent to your LLM provider now go via the proxy. This is usually as simple as setting your base_url in your application code to point to the proxy's host:port — typically this will be localhost:4000. 

Check LLM calls are arriving in the Okareo app as datapoints

Before relying on running your application with the new proxy, you should test that it's working. Try running a cURL request to the endpoint in a new terminal window.

curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a customer support assistant trained to summarize user complaints and route them to the appropriate department."
    },
    {
      "role": "user",
      "content": "Why was I charged twice for my order? I need a refund."
    }
  ]
}

In the Okareo app, you should see a new datapoint appear. Clicking on it allows you to view the conversation between the user and the LLM.

Screen capture of the "Datapoints" tab in the Okareo app.

Let's run two more cURL requests, each with a different customer complaint to the first:

curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a customer support assistant trained to summarize user complaints and route them to the appropriate department."
    },
    {
      "role": "user",
      "content": "The app keeps freezing when I try to log in. Please fix it."
    }
  ]
}'
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a customer support assistant trained to summarize user complaints and route them to the appropriate department."
    },
    {
      "role": "user",
      "content": "I am so frustrated right now. I’ve been a customer for over 5 years and the app has been getting worse and worse. I was trying to place an order for over an hour yesterday, but the app kept freezing. Then I tried to restart it, but it logged me out and wouldn’t accept my password even though I know it’s correct. After multiple attempts, I had to reset my password, which was a hassle because the reset email didn’t arrive for 30 minutes. When I finally got back in, the item I wanted was out of stock. This is not the first time I’ve had issues with your app either. Last month, there was a payment problem where it double-charged my credit card, and I had to spend hours on the phone with customer service to get it resolved. I’ve been loyal to this company, but I don’t know if it’s worth the hassle anymore. I expect better from a company of your reputation. What are you going to do to fix this mess?"
    }
  ]
}'

The spans from all three of these requests should now exist in your datapoints tab in the Okareo app.

Segmenting your data for specific evaluations

Within your datapoints tab, you can filter your data to create a subset of it, and save this as a segment, i.e. a filtered subset of your data. In the video below, we create a segment of datapoints where the latency was more than 1000 ms.

Screen capture of a segment being created in the Okareo app

Once you create a segment, you'll be taken to the Segments tab. Note that as well as the segment you just created, there are some pre-existing Okareo segments such as "Okareo task: summarization" and "Okareo task: role playing." These segments will automatically get populated with datapoints that fit the category. For example, our datapoints have been categorized as "role playing" because the agent is playing the role of a customer support assistant. This could be useful if you wanted to evaluate the tone and style of a specific agent, or if you wanted to see if an agent can maintain this role over a number of conversation turns.

Screen capture of the"role playing" segment in the Okareo app

Running an LLM evaluation

From the Segments tab in the Okareo app, you can run an evaluation on a particular segment by clicking the Run button in the Actions column of that segment's row. You'll be able to name your evaluation run and select the checks you want to run on the evaluation. Okareo will have some checks pre-selected that are likely to be relevant — for the role playing segment, it selects a check called "is in character." You can also add your own checks — here we add one that we created earlier that checks for friendliness:

Screen capture showing how to run an evaluation on a segment in the Okareo app

Using the LLM online evaluation results to improve your app

After a few seconds, the evaluation run is complete, and you can view the results by clicking on the last evaluation run associated with the segment.

Screenshot showing the LLM evaluation results of the Segment in the Okareo app

At the top of the page you can see the average result for each check, but farther down you can see how each datapoint fared. If a datapoint has failed a particular check, you can click on it to see the details of the user input data and the actual model response. This allows you to drill down and understand why it failed and adapt your system accordingly. This could be by changing the system prompt, giving the user clearer instructions, or by fine-tuning your LLM.

Screenshot showing the LLM evaluation results of the Segment in the Okareo app, with details for each datapoint

Take your LLM apps to the next level with Okareo

Evaluating and fine-tuning LLMs doesn’t have to be as complicated or time-consuming as some people make it. If you use LLM observability and LLM monitoring tools to collect real-world data, you can use evaluations to pinpoint areas for improvement. This allows you to continuously optimize your models while still having your applications running smoothly in production. 

Sign up for Okareo

Evaluating LLMs is necessary if you want to ensure they perform well in production environments. It gives you the information you need to make targeted changes to your LLM app, such as improving the system prompt, or fine-tuning when needed. The best way to do either evaluation or fine-tuning is to use real-world user inputs and model responses (recorded using observability and monitoring tools).

By collecting and analyzing telemetry data from your applications and using this data to evaluate specific parts of your app, you can gradually identify areas for improvement and then either make changes to your system prompt or, where that falls short, fine-tune a new model version. These changes can then be deployed as updates without any downtime. This gives you a way to incrementally enhance performance rather than waiting until a model produces 100% perfect results before you can release your product.

In this article, we’ll explore how LLM observability and monitoring data can be used for evaluation and for identifying areas for fine-tuning, and how to use Okareo to easily set up LLM evaluations using this data.

What is LLM observability and monitoring?

LLM observability is the practice of collecting, analyzing, and understanding your LLM application's telemetry data, including unstructured data like traces and logs and structured data in the form of predefined metrics. 

The process of recording structured metrics is called monitoring. LLM monitoring data is typically metadata that includes metrics like latency, number of tokens in the request or response, or the cost per request. This quantitative data can be used to trigger alerts when there's an anomaly or failure, or when anything goes beyond an agreed-upon threshold. For example, if the output response always has a high number of tokens, this suggests you could make your model more concise by changing its system prompt or by fine-tuning. 

LLM observability, being a broader concept than LLM monitoring, allows you to move beyond simple statements of fact and find out why something happened within your application, providing insights into your application's overall behavior and performance. As part of LLM observability, you can track conversations between a user and your system, or between agents in the back end of your system.

This allows you to debug errors or strange behaviors, such as bias, hallucinations, or the LLM failing to follow a user's request, in real time. You can also use these tracked conversions to run evaluations on your LLM and improve its quality by finding ways to improve its responses to a wide variety of queries.

How do LLM observability and monitoring help with LLM evaluation?

LLM observability and monitoring provide a convenient way of collecting telemetry data and organizing it in a useful way so it can be used as data points in your LLM evaluation. A data point is any piece of information used to evaluate your LLM's performance, including user input queries, model outputs, custom system prompts (for context), prior conversation history, or metadata — like response latency, number of tokens in the input or output, error frequency, or financial cost.

You may already be doing LLM evaluation, and you may have manually created some datapoints for this or used synthetic data generation, but now you can use the large amounts of telemetry data to automatically create realistic datapoints for your LLM evaluations (once you've stripped out any personally identifiable information (PII)). More data is useful, as your new evaluation will be able to tell you with more accuracy how well your system is performing.

If you find out that once you have more tests in place your LLM's performance worsens, you can try to improve performance with prompt tuning (adjusting your custom system prompt to try to get the LLM to produce better results for these new use cases) or by fine tuning the model. 

Fine tuning isn't just for data scientists — software engineers can (and should) do this themselves, especially once they've exhausted all their ideas for prompt tuning. Model providers like OpenAI usually provide an API for this that allows you to pass in a carefully prepared training dataset (consisting of inputs and expected model outputs). Note that fine-tuning datasets should be separate to the ones you used for evaluation or you may end up overfitting your results. The model will internally adjust its weights until each input produces its corresponding output more reliably. 

Why you need LLM evaluation and fine-tuning

In order to ensure production reliability, you need LLM evaluation to give you confidence that your app consistently produces good results. You'll need to simulate a variety of inputs in order to determine whether your LLM will respond appropriately to edge cases, and if it doesn't, you can make changes to your system prompt or use fine-tuning to improve its ability.

You can use LLM evaluation to identify issues like bias, security vulnerabilities, or hallucinations, and use this information to improve your model, which protects your application. Regularly running LLM evaluations also helps prevent regressions caused by changes to prompts, updates to the LLM, or model drift, which can happen over time.

Using LLM observability and monitoring data in Okareo's LLM evaluations

Okareo is a tool for running LLM evaluations, which typically consist of a series of checks and test datapoints. A check could be something like how friendly the LLM output is or whether it's in JSON format. Okareo has a wide variety of pre-baked checks that you can use, but you can also create your own custom checks. A test datapoint is a sample user input, sometimes paired with an example of a "gold standard" model output for that particular input.

Now, with Okareo's new "proxy" feature, these test datapoints can come directly from observability and monitoring data from your app. The okareo proxy command inserts a proxy (in this case, LiteLLM) between your app and your LLM, recording all traces and spans as they occur and sending them to Okareo. Within the field of observability, a trace represents the complete workflow of a request as it moves through a distributed system, and a span is just one step within a trace. Each span gets recorded as a separate datapoint in Okareo.

Within the Okareo app, you can then use filters to create a segment of data. Examples of filters include:

  • Token length: Spans where the input text length exceeds 1,000 tokens

  • Keyword presence: Spans where the input query contains a specific key word or phrase such as "summarize" or "translate"

  • Latency: Spans where the latency exceeds 1000ms

Once you have the segment you're interested in, you can run an LLM evaluation using that segment of data.

How to use LLM observability and monitoring data for LLM evaluation using Okareo

Here, we explain how to pull your observability and monitoring spans into Okareo and run an evaluation on a subset of these.

Set up a proxy to record all datapoints in Okareo

First, create a free Okareo account. Within the Okareo app, create an Okareo API key and ensure that it is set as an environment variable, along with the API key for your LLM provider such as OpenAI or Anthropic.

export OKAREO_API_KEY=<YOUR_OKAREO_API_KEY>

Next, install the Okareo CLI tool. Once you've run the commands below, check that it installed correctly with the okareo -v command.

curl -O -L https://github.com/okareo-ai/okareo-cli/releases/download/v0.0.20/okareo_0.0.20_linux_386.tar.gz
tar -xvf okareo_0.0.20_linux_386.tar.gz
export PATH="$PATH:$PWD/bin"

Since Okareo uses the LiteLLM proxy, this needs to be installed. Assuming you already have Python and Pip installed, you can install LiteLLM with the pip install litellm command. Note that depending on whether this gets installed globally or at a user level, you may need to add LiteLLM to your PATH. 

Next, start Okareo with the proxy with the okareo proxy command. By default, the proxy will run on localhost:4000.

Finally, update your app so that any requests that were previously being sent to your LLM provider now go via the proxy. This is usually as simple as setting your base_url in your application code to point to the proxy's host:port — typically this will be localhost:4000. 

Check LLM calls are arriving in the Okareo app as datapoints

Before relying on running your application with the new proxy, you should test that it's working. Try running a cURL request to the endpoint in a new terminal window.

curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a customer support assistant trained to summarize user complaints and route them to the appropriate department."
    },
    {
      "role": "user",
      "content": "Why was I charged twice for my order? I need a refund."
    }
  ]
}

In the Okareo app, you should see a new datapoint appear. Clicking on it allows you to view the conversation between the user and the LLM.

Screen capture of the "Datapoints" tab in the Okareo app.

Let's run two more cURL requests, each with a different customer complaint to the first:

curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a customer support assistant trained to summarize user complaints and route them to the appropriate department."
    },
    {
      "role": "user",
      "content": "The app keeps freezing when I try to log in. Please fix it."
    }
  ]
}'
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a customer support assistant trained to summarize user complaints and route them to the appropriate department."
    },
    {
      "role": "user",
      "content": "I am so frustrated right now. I’ve been a customer for over 5 years and the app has been getting worse and worse. I was trying to place an order for over an hour yesterday, but the app kept freezing. Then I tried to restart it, but it logged me out and wouldn’t accept my password even though I know it’s correct. After multiple attempts, I had to reset my password, which was a hassle because the reset email didn’t arrive for 30 minutes. When I finally got back in, the item I wanted was out of stock. This is not the first time I’ve had issues with your app either. Last month, there was a payment problem where it double-charged my credit card, and I had to spend hours on the phone with customer service to get it resolved. I’ve been loyal to this company, but I don’t know if it’s worth the hassle anymore. I expect better from a company of your reputation. What are you going to do to fix this mess?"
    }
  ]
}'

The spans from all three of these requests should now exist in your datapoints tab in the Okareo app.

Segmenting your data for specific evaluations

Within your datapoints tab, you can filter your data to create a subset of it, and save this as a segment, i.e. a filtered subset of your data. In the video below, we create a segment of datapoints where the latency was more than 1000 ms.

Screen capture of a segment being created in the Okareo app

Once you create a segment, you'll be taken to the Segments tab. Note that as well as the segment you just created, there are some pre-existing Okareo segments such as "Okareo task: summarization" and "Okareo task: role playing." These segments will automatically get populated with datapoints that fit the category. For example, our datapoints have been categorized as "role playing" because the agent is playing the role of a customer support assistant. This could be useful if you wanted to evaluate the tone and style of a specific agent, or if you wanted to see if an agent can maintain this role over a number of conversation turns.

Screen capture of the"role playing" segment in the Okareo app

Running an LLM evaluation

From the Segments tab in the Okareo app, you can run an evaluation on a particular segment by clicking the Run button in the Actions column of that segment's row. You'll be able to name your evaluation run and select the checks you want to run on the evaluation. Okareo will have some checks pre-selected that are likely to be relevant — for the role playing segment, it selects a check called "is in character." You can also add your own checks — here we add one that we created earlier that checks for friendliness:

Screen capture showing how to run an evaluation on a segment in the Okareo app

Using the LLM online evaluation results to improve your app

After a few seconds, the evaluation run is complete, and you can view the results by clicking on the last evaluation run associated with the segment.

Screenshot showing the LLM evaluation results of the Segment in the Okareo app

At the top of the page you can see the average result for each check, but farther down you can see how each datapoint fared. If a datapoint has failed a particular check, you can click on it to see the details of the user input data and the actual model response. This allows you to drill down and understand why it failed and adapt your system accordingly. This could be by changing the system prompt, giving the user clearer instructions, or by fine-tuning your LLM.

Screenshot showing the LLM evaluation results of the Segment in the Okareo app, with details for each datapoint

Take your LLM apps to the next level with Okareo

Evaluating and fine-tuning LLMs doesn’t have to be as complicated or time-consuming as some people make it. If you use LLM observability and LLM monitoring tools to collect real-world data, you can use evaluations to pinpoint areas for improvement. This allows you to continuously optimize your models while still having your applications running smoothly in production. 

Sign up for Okareo

Share:

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.