Testing Self-Hosted, Re-Trained LLMs Using Okareo: A Practical Guide

If you work with large language models (LLMs), you need to evaluate their output whenever they change, so you can be sure they will continue to produce acceptable results.

Learn how to do automated LLM testing using Okareo. We explain how to use Okareo to evaluate self-hosted re-trained LLMs, and cover core concepts.

Working with LLMs could involve re-training an LLM that you host yourself, fine-tuning an LLM, or using a third-party LLM such as GPT-4. In all cases, LLM testing is vital. Okareo is an automated tool for LLM testing, and it can be used for all three of these use cases.

Self-hosting an LLM means you have more control over it, and you can change the model itself, which requires you to then test that the output of the changed LLM is as good as (or better than!) the output of the original LLM.

In this article, we focus specifically on how to use Okareo to evaluate self-hosted, re-trained LLMs — which is useful for ML engineers and data scientists. However, if you're an AI app developer looking to find out how to test your app when it's non-deterministic and the results keep changing, you may also want to have a look at our guide on testing AI applications.

What is LLM testing?

LLM testing (also known as LLM evaluation) means evaluating the output of an LLM by comparing its result with an expected result, a benchmark, or with a previous result you're trying to improve upon.

LLM testing can be manual or automated. If it's automated it can also be included in your CI workflow.

When testing your LLM, you might want to use standard metrics for testing, like consistency, conciseness, relevance, and BLEU score, or you might need to make your own custom checks. With Okareo, you can use both together, for a fully customized experience. Okareo also offers the ability to do AI-assisted checks. This is when one LLM is used to evaluate another LLM – for example, judging that an LLM is using a friendly tone would require the use of another LLM.

Why is LLM testing important?

LLM testing ensures that your LLM results are accurate and reliable, by comparing them against the results from a previously-used version of the LLM, or established benchmarks. It allows you to check that the new LLM response performs better than (or as well as) the previous one. If you don't have control over the LLM, you can compare the results against a set of expected results instead.

Testing your LLM helps to catch hallucinations, which can happen because LLMs’ responses are purely based on statistical patterns, and they don't actually have a "true" understanding of the subject matter — they simply generate text based on a pattern.. An example of a hallucination could be if an LLM was asked "Who was the first astronaut to go to Jupiter?" and it responded "Neil Armstrong." No-one has ever traveled to Jupiter before, so there is no answer to this question that includes a person's name. In this case, the LLM may use patterns to give a plausible-sounding answer.

Another important reason to test your LLM is to help detect (and mitigate against) bias. If your training dataset doesn't contain enough examples, you might end up with responses with unintended bias. For example, if your user asks "What jobs can a person with no college degree get?" and your LLM responds "Without a college degree, people usually end up in low-paying jobs like retail or manual labor," this is an example of socioeconomic bias — and is also not very helpful for your user!

By implementing LLM testing, you can identify areas for improvement and then take steps to implement them. With self-hosted models, this could include making changes to your training data, making algorithmic adjustments to your model, or applying post-processing filters to help reduce bias. If you're using a third-party LLM, you have less power to make changes, but you can rephrase or add more context to the prompts that are sent to the model. You can then re-run the tests to validate that the change has actually improved the LLM’s performance.

Introducing Okareo: an automated LLM testing solution

Okareo is a tool for testing LLMs, or evaluating their output. It can be used to evaluate all types of LLMs — from those that you're hosting yourself for retraining purposes to third-party LLMs that application developers integrate into their apps.

The key elements of Okareo are checks, scenarios and the evaluation process.

A check in Okareo is one specific metric or way to evaluate the LLM's output. These can be predefined standard checks (like consistency or BLEU score) or they can be custom checks tailored to your project's use case (like checking your LLM has a friendly tone, or that the response is code in a specific format).

Okareo also allows you to create a set of scenarios, which are pairs of inputs and expected outputs that are used to test how well the LLM performs. The closer the actual outputs of the LLM are to the expected outputs for each input, the better it has performed.

The evaluation process in Okareo tests the output of the LLM, using the set of scenarios and checks that you defined. You can view the results of this evaluation in the Okareo app, or add them to a CI workflow.

How to test an LLM using Okareo

Okareo tests can be written as Python or TypeScript code. We will use Python in this example. The full code for this tutorial is available on our GitHub.

The main steps are to register your LLM with Okareo, create a scenario set, create some checks, and then run the evaluation.

1. Set up Okareo

Register for an Okareo account and make a note of the API key. Next, install the Okareo CLI so you can run an Okareo flow script on your machine, and initialize it, which should create this directory structure.

2. Register your LLM with Okareo

It's important to register your LLM with Okareo, as this creates a formal connection between the two, which enables Okareo to evaluate your LLM. Even if you're evaluating a third-party LLM like an OpenAI one, you still need to register it with Okareo.

Okareo isn't just for evaluating LLMs. It can also be used to evaluate other types of AI models, such as classification or retrieval models for RAG. Any type of model that you want to evaluate needs to be registered with Okareo.

Our example in this article is of a self-hosted model used for retraining, which is more complex than registering a third-party LLM. You'll need to first define a custom model in your Python Okareo flow script, and then register it.

Define a custom model

A model in Okareo is a class with an invoke() endpoint, which will be hit when your model is called. A custom model is a class that you create. When you use Okareo to evaluate your LLM, it will call the invoke() endpoint of your custom LLM for each scenario's input, and return a result for each input. It will then compare the returned results against the expected result.

Below is an example of a custom model that you could define for your LLM: a pre-trained BART summarization model from HuggingFace. The invoke() method returns a tuple of your model result, along with some context, which is the structure that Okareo expects to perform the evaluation.

from transformers import pipeline

class CustomGenerationModel(CustomModel):
  # Constructor
  def __init__(self):
    self.summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    
  # Define the invoke method to be called on each input of a scenario
  def invoke(self, input: str) -> tuple:  
    # call your model being tested using <input> from the scenario set
    result = self.summarizer(input, max_length=130, min_length=30, do_sample=False)[0]["summary_text"]
    # return a tuple of (model result, overall model response context)
    return result, {"model_response": "Generation successful" }

Register the custom model

Once you've defined your custom model, you'll need to register it with Okareo, which you can do with the following code snippet:

model_under_test = okareo.register_model(
    name="My LLM",
    model=CustomGenerationModel(name="Another name")
)

In this example, we assign the outcome of registering a model to a variable (model_under_test) which will later be used to call the evaluation run.

3. Create a scenario set

A scenario is an LLM input paired with its expected output, and a scenario set is a group of these scenarios. For example, an LLM that deals with customer e-commerce queries might have a scenario set like this:

{
  "input"= "What is your returns policy?",
  "result"= "Our return policy allows for returns within 30 days with proof of purchase."
},
{
  "input"= "How long does shipping take?",
  "result"= "Shipping typically takes 3-5 business days." 
}

Scenario sets are used by Okareo to evaluate your LLM. Once scenarios have been saved in Okareo, they can be reused to seed the generation of other scenarios, but your very first "seed" scenarios must be uploaded from a file or otherwise created.

scenario_set = okareo.upload_scenario_set(
  file_path='./path/to/your/file.jsonl',
  scenario_set_name="your_scenario_set_name"
)

Ensure you get hold of the ID of your scenario set, which is needed to run your LLM evaluation.

scenario_id = scenario_set.scenario_id

4. Define which checks you want to apply to your evaluation

In Okareo, checks are code snippets that help you to evaluate your LLM according to specific metrics. Okareo already has some native checks that are pre-baked into it, and all you need to do to use them is to list them by name:

checks = ['coherence', 'fluency']

Alternatively, you may choose to create and register your own custom checks. Okareo makes this very easy as you can describe your checks using natural language and Okareo will create the code needed.

5. Evaluate your model

Now it's time to test your LLM. Call Okareo's run_test() function on your newly registered model. Pass the scenario set ID to this function (so that each scenario input can be passed to your custom invoke() method). Set your test_run_type to NL_GENERATION, which is the correct type for LLMs, and pass your checks to the evaluation run. Finally, you can print a link to the evaluation results in Okareo, for convenience.

evaluation = model_under_test.run_test(
  name="My LLM Run",
  scenario=scenario_id,
  test_run_type=TestRunType.NL_GENERATION,
  calculate_metrics=True,
  checks=checks,
)
print(f"See results in Okareo: {evaluation.app_link}")

To run your evaluation, type [okareo run](https://okareo.com/docs/sdk/cli#run) in your project directory. Once it's finished running, you can view the results in the Okareo app.

Screenshot of Okareo app LLM testing results page.

Wrapping up your self-hosted LLM testing with Okareo

Testing LLMs is essential to ensure the accuracy, relevance, and reliability of their output. In this article, we've shown you how to test self-hosted LLMs for re-training purposes, but Okareo can be used to test many kinds of LLMs, as well as classification models and RAG pipelines.

Using Okareo for your LLM testing gives you a level of customization. With Okareo, you can easily define custom checks, evaluate model performance against expected results, and integrate these tests into your CI workflows.

If you'd like to try out an LLM evaluation with Okareo, you can sign up and get started for free.

If you work with large language models (LLMs), you need to evaluate their output whenever they change, so you can be sure they will continue to produce acceptable results.

Learn how to do automated LLM testing using Okareo. We explain how to use Okareo to evaluate self-hosted re-trained LLMs, and cover core concepts.

Working with LLMs could involve re-training an LLM that you host yourself, fine-tuning an LLM, or using a third-party LLM such as GPT-4. In all cases, LLM testing is vital. Okareo is an automated tool for LLM testing, and it can be used for all three of these use cases.

Self-hosting an LLM means you have more control over it, and you can change the model itself, which requires you to then test that the output of the changed LLM is as good as (or better than!) the output of the original LLM.

In this article, we focus specifically on how to use Okareo to evaluate self-hosted, re-trained LLMs — which is useful for ML engineers and data scientists. However, if you're an AI app developer looking to find out how to test your app when it's non-deterministic and the results keep changing, you may also want to have a look at our guide on testing AI applications.

What is LLM testing?

LLM testing (also known as LLM evaluation) means evaluating the output of an LLM by comparing its result with an expected result, a benchmark, or with a previous result you're trying to improve upon.

LLM testing can be manual or automated. If it's automated it can also be included in your CI workflow.

When testing your LLM, you might want to use standard metrics for testing, like consistency, conciseness, relevance, and BLEU score, or you might need to make your own custom checks. With Okareo, you can use both together, for a fully customized experience. Okareo also offers the ability to do AI-assisted checks. This is when one LLM is used to evaluate another LLM – for example, judging that an LLM is using a friendly tone would require the use of another LLM.

Why is LLM testing important?

LLM testing ensures that your LLM results are accurate and reliable, by comparing them against the results from a previously-used version of the LLM, or established benchmarks. It allows you to check that the new LLM response performs better than (or as well as) the previous one. If you don't have control over the LLM, you can compare the results against a set of expected results instead.

Testing your LLM helps to catch hallucinations, which can happen because LLMs’ responses are purely based on statistical patterns, and they don't actually have a "true" understanding of the subject matter — they simply generate text based on a pattern.. An example of a hallucination could be if an LLM was asked "Who was the first astronaut to go to Jupiter?" and it responded "Neil Armstrong." No-one has ever traveled to Jupiter before, so there is no answer to this question that includes a person's name. In this case, the LLM may use patterns to give a plausible-sounding answer.

Another important reason to test your LLM is to help detect (and mitigate against) bias. If your training dataset doesn't contain enough examples, you might end up with responses with unintended bias. For example, if your user asks "What jobs can a person with no college degree get?" and your LLM responds "Without a college degree, people usually end up in low-paying jobs like retail or manual labor," this is an example of socioeconomic bias — and is also not very helpful for your user!

By implementing LLM testing, you can identify areas for improvement and then take steps to implement them. With self-hosted models, this could include making changes to your training data, making algorithmic adjustments to your model, or applying post-processing filters to help reduce bias. If you're using a third-party LLM, you have less power to make changes, but you can rephrase or add more context to the prompts that are sent to the model. You can then re-run the tests to validate that the change has actually improved the LLM’s performance.

Introducing Okareo: an automated LLM testing solution

Okareo is a tool for testing LLMs, or evaluating their output. It can be used to evaluate all types of LLMs — from those that you're hosting yourself for retraining purposes to third-party LLMs that application developers integrate into their apps.

The key elements of Okareo are checks, scenarios and the evaluation process.

A check in Okareo is one specific metric or way to evaluate the LLM's output. These can be predefined standard checks (like consistency or BLEU score) or they can be custom checks tailored to your project's use case (like checking your LLM has a friendly tone, or that the response is code in a specific format).

Okareo also allows you to create a set of scenarios, which are pairs of inputs and expected outputs that are used to test how well the LLM performs. The closer the actual outputs of the LLM are to the expected outputs for each input, the better it has performed.

The evaluation process in Okareo tests the output of the LLM, using the set of scenarios and checks that you defined. You can view the results of this evaluation in the Okareo app, or add them to a CI workflow.

How to test an LLM using Okareo

Okareo tests can be written as Python or TypeScript code. We will use Python in this example. The full code for this tutorial is available on our GitHub.

The main steps are to register your LLM with Okareo, create a scenario set, create some checks, and then run the evaluation.

1. Set up Okareo

Register for an Okareo account and make a note of the API key. Next, install the Okareo CLI so you can run an Okareo flow script on your machine, and initialize it, which should create this directory structure.

2. Register your LLM with Okareo

It's important to register your LLM with Okareo, as this creates a formal connection between the two, which enables Okareo to evaluate your LLM. Even if you're evaluating a third-party LLM like an OpenAI one, you still need to register it with Okareo.

Okareo isn't just for evaluating LLMs. It can also be used to evaluate other types of AI models, such as classification or retrieval models for RAG. Any type of model that you want to evaluate needs to be registered with Okareo.

Our example in this article is of a self-hosted model used for retraining, which is more complex than registering a third-party LLM. You'll need to first define a custom model in your Python Okareo flow script, and then register it.

Define a custom model

A model in Okareo is a class with an invoke() endpoint, which will be hit when your model is called. A custom model is a class that you create. When you use Okareo to evaluate your LLM, it will call the invoke() endpoint of your custom LLM for each scenario's input, and return a result for each input. It will then compare the returned results against the expected result.

Below is an example of a custom model that you could define for your LLM: a pre-trained BART summarization model from HuggingFace. The invoke() method returns a tuple of your model result, along with some context, which is the structure that Okareo expects to perform the evaluation.

from transformers import pipeline

class CustomGenerationModel(CustomModel):
  # Constructor
  def __init__(self):
    self.summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    
  # Define the invoke method to be called on each input of a scenario
  def invoke(self, input: str) -> tuple:  
    # call your model being tested using <input> from the scenario set
    result = self.summarizer(input, max_length=130, min_length=30, do_sample=False)[0]["summary_text"]
    # return a tuple of (model result, overall model response context)
    return result, {"model_response": "Generation successful" }

Register the custom model

Once you've defined your custom model, you'll need to register it with Okareo, which you can do with the following code snippet:

model_under_test = okareo.register_model(
    name="My LLM",
    model=CustomGenerationModel(name="Another name")
)

In this example, we assign the outcome of registering a model to a variable (model_under_test) which will later be used to call the evaluation run.

3. Create a scenario set

A scenario is an LLM input paired with its expected output, and a scenario set is a group of these scenarios. For example, an LLM that deals with customer e-commerce queries might have a scenario set like this:

{
  "input"= "What is your returns policy?",
  "result"= "Our return policy allows for returns within 30 days with proof of purchase."
},
{
  "input"= "How long does shipping take?",
  "result"= "Shipping typically takes 3-5 business days." 
}

Scenario sets are used by Okareo to evaluate your LLM. Once scenarios have been saved in Okareo, they can be reused to seed the generation of other scenarios, but your very first "seed" scenarios must be uploaded from a file or otherwise created.

scenario_set = okareo.upload_scenario_set(
  file_path='./path/to/your/file.jsonl',
  scenario_set_name="your_scenario_set_name"
)

Ensure you get hold of the ID of your scenario set, which is needed to run your LLM evaluation.

scenario_id = scenario_set.scenario_id

4. Define which checks you want to apply to your evaluation

In Okareo, checks are code snippets that help you to evaluate your LLM according to specific metrics. Okareo already has some native checks that are pre-baked into it, and all you need to do to use them is to list them by name:

checks = ['coherence', 'fluency']

Alternatively, you may choose to create and register your own custom checks. Okareo makes this very easy as you can describe your checks using natural language and Okareo will create the code needed.

5. Evaluate your model

Now it's time to test your LLM. Call Okareo's run_test() function on your newly registered model. Pass the scenario set ID to this function (so that each scenario input can be passed to your custom invoke() method). Set your test_run_type to NL_GENERATION, which is the correct type for LLMs, and pass your checks to the evaluation run. Finally, you can print a link to the evaluation results in Okareo, for convenience.

evaluation = model_under_test.run_test(
  name="My LLM Run",
  scenario=scenario_id,
  test_run_type=TestRunType.NL_GENERATION,
  calculate_metrics=True,
  checks=checks,
)
print(f"See results in Okareo: {evaluation.app_link}")