How an LLM Evaluation Framework Can Help You Improve Your Product

If you’re developing an LLM-based app, you need a comprehensive testing strategy to make sure your app meets its requirements and to prevent it from being negatively impacted by any changes you make or by model drift over time. Testing LLMs and apps built on them is different from testing most other apps, because the output of LLMs is not deterministic. Here we explain what an LLM evaluation framework is and how it can help you.

What is an LLM evaluation framework?

An LLM evaluation framework is a comprehensive tool for testing any product that uses an LLM and verifying that the LLM output meets its requirements.

The key components of such a framework include:

Evaluation metrics: Quantitative measurements for assessing LLM performance, such as the word count of the output, whether it’s well-formed JSON, or how friendly it sounds on a scale of 1 to 5.
Test cases: Datasets of example inputs, sometimes paired with expected outputs (as a ground truth) to be compared with the actual output. These datasets are known as scenarios.
Benchmarking capabilities: Comparisons of different LLMs or versions of the same LLM based on a set of standards, allowing you to compare results over time.
Automation and CI/CD integration: You don't just need to run an evaluation once; you need to continually test, and for that you need to automate your tests.
Diverse testing methodologies: Some metrics can be evaluated with a simple deterministic test, while others are more complex or subjective. For these more complex cases, you can use another LLM (“LLM-as-a-judge”) to evaluate your LLM’s output according to subjective metrics like "friendliness."

The key components of such a framework include test metrics, datasets of scenarios (consisting of example input and output scenarios), benchmarking capabilities to compare results over time, and tools to enable automated and continuous testing. An LLM evaluation framework also includes diverse testing methodologies applicable to each kind of metric; for example, some metrics can be evaluated with a simple deterministic test, while others are more complex or subjective and can be evaluated using another LLM (“LLM-as-a-judge”).

Why you need an LLM evaluation framework

Whenever you make any changes to your system prompt or to your model (whether retraining or fine-tuning your own model, or switching to a different third-party model), you need to test the effect of those changes to make sure the output has improved in the way you intended and hasn’t become worse in other ways. You also need to check that the output from your model isn’t degrading over time (“model drift”) even if you haven’t changed anything.

This is harder than testing traditional apps, because the output from LLMs is nondeterministic. This means you can’t test an LLM-powered app using deterministic tests like “assert that the output is exactly equal to XYZ” because the output won’t be the same every time. Instead, you need either a human or another LLM to read the output and evaluate how coherent or friendly it is, or whether it’s sufficiently similar to an example output.

Another issue with the nondeterministic output from LLMs is that, if you run a test query and get particularly good or bad output, how do you know if it’s representative or just a lucky or unlucky roll of the dice? You need to evaluate your LLM at scale, running many similar versions of a query and deliberately introducing variations, and then analyze the results statistically to see whether any improvement or degradation is significant rather than just a blip. This way, you can achieve consistency and reliability for your app even though it’s built on a nondeterministic technology.

Given that you need this capability, a proven tool for the purpose is better than rolling your own: it’s more reliable, and allows you to spend more time on your actual product.

What to look for in a good LLM evaluation framework

Customizable metrics. A framework that only applies a predefined list of metrics would be too restrictive. Look for a framework that lets you not only pick and mix from a variety of metrics but also add your own. These could be criteria that the framework can test deterministically (like whether the LLM outputs well-formed code), or criteria that the framework can judge using another LLM (like how friendly or authoritative the response sounds). Ideally, the framework should give you the option to specify further details. If you’re testing for well-formed code, it would let you provide the specification that the output code should meet; if you’re getting an LLM judge to evaluate the tone of the output, it would let you customize the prompt for the judging LLM.

Automated scenario generation. It’s inevitable that you will have to write some amount of example input and output yourself, but one big advantage of an LLM evaluation framework is that you can give it one such “seed scenario” and it can automatically generate multiple variations on the theme. For example, if you provide an example of a customer question, it can generate a suite of similar questions and alternative ways of phrasing the original question. This lets you conduct large-scale testing of your product without needing to write all the test scenarios manually.

Support for different types of applications. An LLM evaluation framework should be able to evaluate different types of applications, like retrieval-augmented generation (RAG) and agentic AI, including multiturn agents. Ideally, besides evaluating text generation by LLMs, the framework should also support evaluating the performance of other models, like those used for retrieval or classification.

Broad model compatibility. A framework that integrates with models from a variety of ML platforms, such as Hugging Face and OpenAI, is flexible and future-proof, giving you the option to switch out your model later. It’s even better if the framework has support for custom models, allowing you to integrate any model from anywhere, swap in a deterministic mock model for testing, or even build your own model from scratch.

Benchmarking. The framework should allow you to compare results against previous results or against the results for a different model or system prompt, so that you can quantify and analyze changes over time.

Support for automation and continuous evaluation. You need an LLM evaluation framework that can automate evaluations and can integrate into your existing CI/CD system.

Multi-language and multi-platform support. A good LLM evaluation framework will work with familiar programming languages, such as Python or JavaScript/TypeScript, rather than requiring users to learn a new custom language. It will also work on different platforms, such as Linux, Windows, and Mac.

Performance. The framework you use needs to be scalable without sacrificing performance, so that it can conduct large-scale testing with large amounts of data and not become unusably slow.

Easy-to-use reporting. The framework should output its results in a user-friendly style that both technical and non-technical team members can understand and also in a structured, parseable format such as JSON that can be consumed by downstream tools for further analysis.

Okareo is an LLM evaluation framework with easy-to-use reporting. It offers out-of-the box metrics such as coherence and fluency, as well as the option to create your own metrics. Once you've run an evaluation, you'll see a reporting page, like the one below, that shows how well each metric performed overall, as well as a row-by-row breakdown for each individual test scenario.

Screenshot of a GUI results page from the Okareo LLM evaluation framework

A user-friendly report from the Okareo LLM evaluation framework.

The role of automation in LLM evaluation frameworks

You don’t want to have to manually test the effect of changes you make to the model or the prompt when you could be spending that time improving your product itself. You also don’t want to be spending time and energy cobbling together a testing solution from shell scripts when you could use a proven product that’s built for the purpose.

A versatile evaluation framework that can run tests on schedule, on demand, or on code commit will take that whole task off your plate. It’s even better if it integrates with existing CI/CD tools like GitHub Actions, CircleCI, BitBucket Pipelines, or GitLab CI/CD — then you can slot it into your existing CI/CD workflow.

The future of LLM evaluation frameworks

An exciting new development in LLM evaluation frameworks is the ability to evaluate AI agents, which support multiturn conversation and can delegate tasks to specialized subsystems.

Examples of AI agents that can delegate tasks to other subsystems

When evaluating multiturn conversations, the framework doesn't just evaluate a one-shot output like a piece of generated text. Instead, it uses another LLM, called a driver, to act as the user in the conversation with the agent under test. The driver has its own prompt and its own goal, which is to test the agent adversarially by trying to lure it off topic or manipulate it into leaking sensitive information.

A screenshot of a conversation between a driver (an LLM adversarially simulating a user) and a target agent

Here, the agent’s instructions tell it not to mention the names of WebBizz’s competitors, but the driver has been instructed to try to get the agent to reveal that information.

This opens up new possibilities for evaluating multiturn agents: rather than needing a human to evaluate them in a live back-and-forth conversation, an LLM evaluation framework can automate the process, which means it can be done at scale and can be added to a CI/CD pipeline.

Maximize LLM performance with Okareo's LLM evaluation framework

An LLM evaluation framework lets you automatically test, at scale, the effect of any changes to your model or your system prompt, and gather benchmarking data to verify the reliability and consistency of your product. In this article, we’ve looked at why you need an LLM framework and explored the features it should have.

Okareo is a robust LLM evaluation framework that provides an extensive suite of built-in metrics and allows you to define your own, and automatically generates batches of test scenarios from a given seed scenario. It integrates with popular CI/CD tools so that you can incorporate it into your existing workflows.

Sign up here if you'd like to try Okareo.

If you’re developing an LLM-based app, you need a comprehensive testing strategy to make sure your app meets its requirements and to prevent it from being negatively impacted by any changes you make or by model drift over time. Testing LLMs and apps built on them is different from testing most other apps, because the output of LLMs is not deterministic. Here we explain what an LLM evaluation framework is and how it can help you.

What is an LLM evaluation framework?

An LLM evaluation framework is a comprehensive tool for testing any product that uses an LLM and verifying that the LLM output meets its requirements.

The key components of such a framework include:

Evaluation metrics: Quantitative measurements for assessing LLM performance, such as the word count of the output, whether it’s well-formed JSON, or how friendly it sounds on a scale of 1 to 5.
Test cases: Datasets of example inputs, sometimes paired with expected outputs (as a ground truth) to be compared with the actual output. These datasets are known as scenarios.
Benchmarking capabilities: Comparisons of different LLMs or versions of the same LLM based on a set of standards, allowing you to compare results over time.
Automation and CI/CD integration: You don't just need to run an evaluation once; you need to continually test, and for that you need to automate your tests.
Diverse testing methodologies: Some metrics can be evaluated with a simple deterministic test, while others are more complex or subjective. For these more complex cases, you can use another LLM (“LLM-as-a-judge”) to evaluate your LLM’s output according to subjective metrics like "friendliness."