Testing AI Applications
Evaluation
Matt Wyman
,
Co-founder of Okareo
Oleksii Klochai
,
Technical Content Writer
July 26, 2024
Testing AI applications is tricky because they’re non-deterministic. Here’s how to use Okareo to test the output of your AI-powered apps.
If you’re building an application that takes advantage of AI and LLMs, it's just as important to thoroughly test your code as it is with any other type of application. However, there are a number of issues that make it trickier to test AI apps than non-AI apps.
When writing traditional software, it’s straightforward to write some code for a piece of functionality and then write a deterministic test to ensure that this functionality works. For an AI application, however, a “test” isn’t such a simple concept. First, LLMs are not deterministic, so if you're writing a test around an LLM, you can’t always expect the same output. This is a problem as standard software tests expect a deterministic output.
Second, it is unlikely that you are using just one model, and the unpredictability when combining multiple models can skyrocket. Third, components such as RAG or agent architecture might influence the outputs significantly and aren’t easy to test. And fourth, reliance on third-party APIs may cause additional uncertainty around end user experience — for example, when the provider updates their model or changes its training data.
While it is not straightforward to test AI applications, there are ways. In this article, we show how you can do it using Okareo, a platform for testing LLM applications. We demonstrate with a simple Next.js AI app (powered by GPT-4o) that helps users identify the processing method of a particular coffee based on the tasting notes written on its packaging.
The key difference between testing non-AI and AI applications
The overall testing paradigm is well established in the software world. Generally, you would have tests structured around a pyramid-like structure with many unit tests (function level), some integration tests, and a few end-to-end tests.
Testing non-AI software: The key point when testing regular software is: you can supply one input, and expect to receive one correct output. You can even simulate an entire end-to-end scenario with negligible differences from one execution to another. The fact that most code produces deterministic output makes it possible to write tests that supply one input to a function or make an API call, and expect a certain response that you can directly compare character by character.
Testing AI software: Unlike regular software, AI apps are harder to test because of their non-deterministic output. With AI testing, a model might have many correct inputs, many correct outputs, and many incorrect outputs that are very difficult to separate from the correct ones. When testing an LLM component, for example, it is no longer possible to have a unit test that compares a single output to what’s expected, character by character, as each time your testing code calls the LLM you will get slightly different results, even when using the exact same prompt.
Key approaches to testing AI applications
Manual testing
The very first steps in testing AI are basically ad-hoc or manual prompt tests where a person writes a prompt, looks at the output, and decides if the results look good. Manual testing is an important step when working on a new model – a human should always sanity-check any outputs of an AI app to see if they produce expected results with different prompts, including prompts containing mistakes or unexpected data.
Unfortunately, manual tests can't be your only testing method once your AI app moves from development to production, as manual testing doesn’t scale well. If you want to do manual testing well, it becomes too time-consuming, where a person or even a team of people isn’t able to cover the full range of possible inputs and outputs to ensure the application’s quality.
Automated tests
In addition to manual testing, production AI applications require a data-driven approach to testing: supplying many inputs, analyzing the many outputs in aggregate, and deciding whether the application is usable or not based on the characteristics of the outputs.
The suitability of a test then depends on the chosen data characteristics. The better you align the test metrics to the behavior you want to cover, the clearer the results of testing will be. The characteristics traditionally picked for model evaluation, such as truthfulness of answers and the ability of an LLM to complete a sentence, might be too broad, and too far removed from what you need to test your application, to be useful.
If you’re building an application that is supposed to answer user questions about building roads, for example, you will likely care about the system’s answers in that specific domain, but most model evaluations rarely cover domain-specific performance. Instead, you need to do custom evaluation of the model in your specific domain.
Testing in a data-driven fashion also requires lots of input data. It’s rarely feasible to have this come directly from developers or users. So, in many cases, trustworthy synthetic data needs to be used for input generation at scale. To get hold of it, it’s common to use an LLM to create more variations synthetically, based on a set of manually-created seed data. If your application depends on industry-specific data, such as in our road construction example, generating synthetic input data can make (if adequate) or break (if inadequate) your entire testing paradigm.
Besides selecting the criteria that you’ll test against, you might also need to be intentional about what “correct” looks like. You will rarely have 100%-strict criteria, such as “the model should always return a correct answer”, because, as we explained before, testing a model is probabilistic rather than deterministic. Perhaps being correct in 80% of cases is good enough for your use case, or maybe you need to aim for 95% — and the architectures required to support these correctness criteria may be quite different.
Finally, the whole point of testing is not only to understand if your application is working correctly or not; it’s to provide a feedback loop to the development team with indications on what they can improve or fix to address a specific kind of behavior in their application.
How to test an AI application using Okareo
Let’s look at how to automatically test an AI application using the methodology we described. Using Okareo, you will create an automated test that generates a set of possible inputs for an app, and then run those inputs through the app and verify the outputs automatically.
We decided to focus this tutorial on building an application that uses an LLM to deduce a coffee’s processing method from the tasting notes on the final package. For example, if the taste notes written on a coffee package say “peach, jasmine, black tea”, the user would input that information into a simple form in the app and get the response “Most likely processing method: Washed or Anaerobic” — as in this case the washed, or wet, coffee processing method is most likely to make the coffee taste this way, but it would also be possible to have anaerobically processed coffee to taste this way.
Start by creating a simple app with TypeScript, Next.js, and React. The core API function reaches out to the OpenAI API and supplies a prompt designed to get the kind of response you require:
import OpenAI from "openai";
const openai = new OpenAI();
export const SYSTEM_PROMPT = `You are a coffee Q grader with lots of experience in the speciality coffee field.`
export function user_prompt(tasteNotes: string) : string {
return `I have brewed some coffee, and the notes I taste are:${tasteNotes}.
Please respond with the words "most likely processing method" followed
by the most likely processing method used for this coffee. Be brief in
the response, just mention the processing method name and no extra
information. If multiple methods are likely, then mention the most
likely options, but omit any extra information.`
}
async function tasteQuery(tasteNotes: string) : Promise<string> {
const { data: completion, response } = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: SYSTEM_PROMPT },
{ role
We then create a small front end using React to enable the user to submit a form, and have the form’s contents be added to the prompt. (This approach has security limitations, but we’re ignoring them at this stage.)
You can find the complete code example that you can run in our GitHub repo.
Setting up an initial test
For testing this application, we will be using Jest, and we’ll start by adding a simple deterministic unit test, to cover our prompt generation function (without including the AI functionality yet):
import generate, { SYSTEM_PROMPT, user_prompt } from "../pages/api/generate";
import { Okareo, OpenAIModel } from "okareo-ts-sdk"
const OKAREO_API_KEY = process.env.OKAREO_API_KEY || "";
const OPENAI_API_KEY = process.env.OPENAI_API_KEY || "";
describe('Prompt concatenation function works', () => {
it('Should return a prompt that includes our prompt template text', () => {
let tasteNotes = "peach, jasmine";
expect(user_prompt(tasteNotes)).toContain("I have brewed some coffee")
})
})
We then run this test using Jest:
$ npm test
> openai-okareo-coffee-app@0.1.0 test
> jest
PASS test/tastenotes.test.tsx
Prompt concatenation function works
✓ Should return a prompt that includes our prompt template text (1 ms)
Test Suites: 1 passed, 1 total
Tests: 1 passed, 1 total
Snapshots: 0 total
Time: 1
Adding a behavioral test
Now we’ll test the behavior of the AI part of our application. The core value that our application provides to the user, of course, is accurate answers — so we’ll test the accuracy of answers using for a sample set with known-good answers that we’ll provide ahead of time.
To accomplish this, we add a test using Okareo. We start by defining the test header and noting the Okareo project ID so that all results from testing this project end up in the same place in the Okareo dashboard:
describe('Answer generation', () => {
it('Sshould return a reasonable answer', async () => {
const projects: any[] = await okareo.getProjects();
const project_id = projects.find((p) => p.name === PROJECT_NAME)?.id;
...
Next, we supply write down some seed data (known-good answers) and then use Okareo’s Scenario Set functionality to augment the seed data with a further nine9 synthetic example user input phrases. The reason why Okareo creates more synthetic data is to make the tests cover a slightly broader set of inputs and thus provide better testing coverage.
...
const TEST_SEED_DATA = [
SeedData({ input: "lemon zest", result: "Washed" }),
SeedData({ input: "Spicy, black pepper", result: "Natural" }),
SeedData({ input: "caramel", result: "Honey" }),
];
const scenario: any = await okareo.create_scenario_set({
name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
project_id: project_id,
seed_data: TEST_SEED_DATA,
});
...
We now need to register our model with Okareo and point it to our system and user prompt templates:
...
const model = await okareo.register_model({
name: MODEL_NAME,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
models: {
type: "openai",
model_id: "gpt-4o",
temperature: 0.1,
system_prompt_template: SYSTEM_PROMPT,
user_prompt_template: USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});
...
Now we can run the evaluation with all the parameters we included:
...
const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: `${MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: scenario,
calculate_metrics: true,
type: TestRunType.MULTI_CLASS_CLASSIFICATION,
} as RunTestProps
);
...
The results of the evaluation will be stored in the eval_run
variable, and we can now generate a test report based on that variable’s contents:
...
const report_definition = {
error_max: 8,
metrics_min: {
precision: 0.5,
recall: 0.5,
f1: 0.5,
accuracy: 0.5,
},
};
expect(eval_run.model_metrics)
const reporter = new ClassificationReporter({
eval_run:eval_run, });
...report_definition,
});
Finally, we can add an assertion that makes the test pass when the Okareo evaluation passes, and fail when the Okareo evaluation fails:
await expect(reporter.pass).toBeTruthy;
We can now re-run the test in Jest and get a more complete result:
$ npm run test
> openai-okareo-coffee-app@0.1.0 test
> jest --config ./jest.okareo-config.js
PASS test/tastenotes.test.tsx (24.791 s)
Prompt concatenation function works
✓ Should return a prompt that includes our prompt template text (1 ms)
Answer generation
✓ Sshould return a reasonable answer (24033 ms)
Test Suites: 1 passed, 1 total
Tests: 2 passed, 2 total
Snapshots: 0 total
Time: 24
We can see the more detailed results of the evaluation in Okareo’s interface:
Try Okareo for testing your AI application
If you’d like to try testing your AI application with Okareo, sign up free and then follow our documentation to get started.
You can find the full repo with the example above that you can clone and try out here.
Testing AI applications is tricky because they’re non-deterministic. Here’s how to use Okareo to test the output of your AI-powered apps.
If you’re building an application that takes advantage of AI and LLMs, it's just as important to thoroughly test your code as it is with any other type of application. However, there are a number of issues that make it trickier to test AI apps than non-AI apps.
When writing traditional software, it’s straightforward to write some code for a piece of functionality and then write a deterministic test to ensure that this functionality works. For an AI application, however, a “test” isn’t such a simple concept. First, LLMs are not deterministic, so if you're writing a test around an LLM, you can’t always expect the same output. This is a problem as standard software tests expect a deterministic output.
Second, it is unlikely that you are using just one model, and the unpredictability when combining multiple models can skyrocket. Third, components such as RAG or agent architecture might influence the outputs significantly and aren’t easy to test. And fourth, reliance on third-party APIs may cause additional uncertainty around end user experience — for example, when the provider updates their model or changes its training data.
While it is not straightforward to test AI applications, there are ways. In this article, we show how you can do it using Okareo, a platform for testing LLM applications. We demonstrate with a simple Next.js AI app (powered by GPT-4o) that helps users identify the processing method of a particular coffee based on the tasting notes written on its packaging.
The key difference between testing non-AI and AI applications
The overall testing paradigm is well established in the software world. Generally, you would have tests structured around a pyramid-like structure with many unit tests (function level), some integration tests, and a few end-to-end tests.
Testing non-AI software: The key point when testing regular software is: you can supply one input, and expect to receive one correct output. You can even simulate an entire end-to-end scenario with negligible differences from one execution to another. The fact that most code produces deterministic output makes it possible to write tests that supply one input to a function or make an API call, and expect a certain response that you can directly compare character by character.
Testing AI software: Unlike regular software, AI apps are harder to test because of their non-deterministic output. With AI testing, a model might have many correct inputs, many correct outputs, and many incorrect outputs that are very difficult to separate from the correct ones. When testing an LLM component, for example, it is no longer possible to have a unit test that compares a single output to what’s expected, character by character, as each time your testing code calls the LLM you will get slightly different results, even when using the exact same prompt.
Key approaches to testing AI applications
Manual testing
The very first steps in testing AI are basically ad-hoc or manual prompt tests where a person writes a prompt, looks at the output, and decides if the results look good. Manual testing is an important step when working on a new model – a human should always sanity-check any outputs of an AI app to see if they produce expected results with different prompts, including prompts containing mistakes or unexpected data.
Unfortunately, manual tests can't be your only testing method once your AI app moves from development to production, as manual testing doesn’t scale well. If you want to do manual testing well, it becomes too time-consuming, where a person or even a team of people isn’t able to cover the full range of possible inputs and outputs to ensure the application’s quality.
Automated tests
In addition to manual testing, production AI applications require a data-driven approach to testing: supplying many inputs, analyzing the many outputs in aggregate, and deciding whether the application is usable or not based on the characteristics of the outputs.
The suitability of a test then depends on the chosen data characteristics. The better you align the test metrics to the behavior you want to cover, the clearer the results of testing will be. The characteristics traditionally picked for model evaluation, such as truthfulness of answers and the ability of an LLM to complete a sentence, might be too broad, and too far removed from what you need to test your application, to be useful.
If you’re building an application that is supposed to answer user questions about building roads, for example, you will likely care about the system’s answers in that specific domain, but most model evaluations rarely cover domain-specific performance. Instead, you need to do custom evaluation of the model in your specific domain.
Testing in a data-driven fashion also requires lots of input data. It’s rarely feasible to have this come directly from developers or users. So, in many cases, trustworthy synthetic data needs to be used for input generation at scale. To get hold of it, it’s common to use an LLM to create more variations synthetically, based on a set of manually-created seed data. If your application depends on industry-specific data, such as in our road construction example, generating synthetic input data can make (if adequate) or break (if inadequate) your entire testing paradigm.
Besides selecting the criteria that you’ll test against, you might also need to be intentional about what “correct” looks like. You will rarely have 100%-strict criteria, such as “the model should always return a correct answer”, because, as we explained before, testing a model is probabilistic rather than deterministic. Perhaps being correct in 80% of cases is good enough for your use case, or maybe you need to aim for 95% — and the architectures required to support these correctness criteria may be quite different.
Finally, the whole point of testing is not only to understand if your application is working correctly or not; it’s to provide a feedback loop to the development team with indications on what they can improve or fix to address a specific kind of behavior in their application.
How to test an AI application using Okareo
Let’s look at how to automatically test an AI application using the methodology we described. Using Okareo, you will create an automated test that generates a set of possible inputs for an app, and then run those inputs through the app and verify the outputs automatically.
We decided to focus this tutorial on building an application that uses an LLM to deduce a coffee’s processing method from the tasting notes on the final package. For example, if the taste notes written on a coffee package say “peach, jasmine, black tea”, the user would input that information into a simple form in the app and get the response “Most likely processing method: Washed or Anaerobic” — as in this case the washed, or wet, coffee processing method is most likely to make the coffee taste this way, but it would also be possible to have anaerobically processed coffee to taste this way.
Start by creating a simple app with TypeScript, Next.js, and React. The core API function reaches out to the OpenAI API and supplies a prompt designed to get the kind of response you require:
import OpenAI from "openai";
const openai = new OpenAI();
export const SYSTEM_PROMPT = `You are a coffee Q grader with lots of experience in the speciality coffee field.`
export function user_prompt(tasteNotes: string) : string {
return `I have brewed some coffee, and the notes I taste are:${tasteNotes}.
Please respond with the words "most likely processing method" followed
by the most likely processing method used for this coffee. Be brief in
the response, just mention the processing method name and no extra
information. If multiple methods are likely, then mention the most
likely options, but omit any extra information.`
}
async function tasteQuery(tasteNotes: string) : Promise<string> {
const { data: completion, response } = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: SYSTEM_PROMPT },
{ role
We then create a small front end using React to enable the user to submit a form, and have the form’s contents be added to the prompt. (This approach has security limitations, but we’re ignoring them at this stage.)
You can find the complete code example that you can run in our GitHub repo.
Setting up an initial test
For testing this application, we will be using Jest, and we’ll start by adding a simple deterministic unit test, to cover our prompt generation function (without including the AI functionality yet):
import generate, { SYSTEM_PROMPT, user_prompt } from "../pages/api/generate";
import { Okareo, OpenAIModel } from "okareo-ts-sdk"
const OKAREO_API_KEY = process.env.OKAREO_API_KEY || "";
const OPENAI_API_KEY = process.env.OPENAI_API_KEY || "";
describe('Prompt concatenation function works', () => {
it('Should return a prompt that includes our prompt template text', () => {
let tasteNotes = "peach, jasmine";
expect(user_prompt(tasteNotes)).toContain("I have brewed some coffee")
})
})
We then run this test using Jest:
$ npm test
> openai-okareo-coffee-app@0.1.0 test
> jest
PASS test/tastenotes.test.tsx
Prompt concatenation function works
✓ Should return a prompt that includes our prompt template text (1 ms)
Test Suites: 1 passed, 1 total
Tests: 1 passed, 1 total
Snapshots: 0 total
Time: 1
Adding a behavioral test
Now we’ll test the behavior of the AI part of our application. The core value that our application provides to the user, of course, is accurate answers — so we’ll test the accuracy of answers using for a sample set with known-good answers that we’ll provide ahead of time.
To accomplish this, we add a test using Okareo. We start by defining the test header and noting the Okareo project ID so that all results from testing this project end up in the same place in the Okareo dashboard:
describe('Answer generation', () => {
it('Sshould return a reasonable answer', async () => {
const projects: any[] = await okareo.getProjects();
const project_id = projects.find((p) => p.name === PROJECT_NAME)?.id;
...
Next, we supply write down some seed data (known-good answers) and then use Okareo’s Scenario Set functionality to augment the seed data with a further nine9 synthetic example user input phrases. The reason why Okareo creates more synthetic data is to make the tests cover a slightly broader set of inputs and thus provide better testing coverage.
...
const TEST_SEED_DATA = [
SeedData({ input: "lemon zest", result: "Washed" }),
SeedData({ input: "Spicy, black pepper", result: "Natural" }),
SeedData({ input: "caramel", result: "Honey" }),
];
const scenario: any = await okareo.create_scenario_set({
name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
project_id: project_id,
seed_data: TEST_SEED_DATA,
});
...
We now need to register our model with Okareo and point it to our system and user prompt templates:
...
const model = await okareo.register_model({
name: MODEL_NAME,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
models: {
type: "openai",
model_id: "gpt-4o",
temperature: 0.1,
system_prompt_template: SYSTEM_PROMPT,
user_prompt_template: USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});
...
Now we can run the evaluation with all the parameters we included:
...
const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: `${MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: scenario,
calculate_metrics: true,
type: TestRunType.MULTI_CLASS_CLASSIFICATION,
} as RunTestProps
);
...
The results of the evaluation will be stored in the eval_run
variable, and we can now generate a test report based on that variable’s contents:
...
const report_definition = {
error_max: 8,
metrics_min: {
precision: 0.5,
recall: 0.5,
f1: 0.5,
accuracy: 0.5,
},
};
expect(eval_run.model_metrics)
const reporter = new ClassificationReporter({
eval_run:eval_run, });
...report_definition,
});
Finally, we can add an assertion that makes the test pass when the Okareo evaluation passes, and fail when the Okareo evaluation fails:
await expect(reporter.pass).toBeTruthy;
We can now re-run the test in Jest and get a more complete result:
$ npm run test
> openai-okareo-coffee-app@0.1.0 test
> jest --config ./jest.okareo-config.js
PASS test/tastenotes.test.tsx (24.791 s)
Prompt concatenation function works
✓ Should return a prompt that includes our prompt template text (1 ms)
Answer generation
✓ Sshould return a reasonable answer (24033 ms)
Test Suites: 1 passed, 1 total
Tests: 2 passed, 2 total
Snapshots: 0 total
Time: 24
We can see the more detailed results of the evaluation in Okareo’s interface:
Try Okareo for testing your AI application
If you’d like to try testing your AI application with Okareo, sign up free and then follow our documentation to get started.
You can find the full repo with the example above that you can clone and try out here.
Testing AI applications is tricky because they’re non-deterministic. Here’s how to use Okareo to test the output of your AI-powered apps.
If you’re building an application that takes advantage of AI and LLMs, it's just as important to thoroughly test your code as it is with any other type of application. However, there are a number of issues that make it trickier to test AI apps than non-AI apps.
When writing traditional software, it’s straightforward to write some code for a piece of functionality and then write a deterministic test to ensure that this functionality works. For an AI application, however, a “test” isn’t such a simple concept. First, LLMs are not deterministic, so if you're writing a test around an LLM, you can’t always expect the same output. This is a problem as standard software tests expect a deterministic output.
Second, it is unlikely that you are using just one model, and the unpredictability when combining multiple models can skyrocket. Third, components such as RAG or agent architecture might influence the outputs significantly and aren’t easy to test. And fourth, reliance on third-party APIs may cause additional uncertainty around end user experience — for example, when the provider updates their model or changes its training data.
While it is not straightforward to test AI applications, there are ways. In this article, we show how you can do it using Okareo, a platform for testing LLM applications. We demonstrate with a simple Next.js AI app (powered by GPT-4o) that helps users identify the processing method of a particular coffee based on the tasting notes written on its packaging.
The key difference between testing non-AI and AI applications
The overall testing paradigm is well established in the software world. Generally, you would have tests structured around a pyramid-like structure with many unit tests (function level), some integration tests, and a few end-to-end tests.
Testing non-AI software: The key point when testing regular software is: you can supply one input, and expect to receive one correct output. You can even simulate an entire end-to-end scenario with negligible differences from one execution to another. The fact that most code produces deterministic output makes it possible to write tests that supply one input to a function or make an API call, and expect a certain response that you can directly compare character by character.
Testing AI software: Unlike regular software, AI apps are harder to test because of their non-deterministic output. With AI testing, a model might have many correct inputs, many correct outputs, and many incorrect outputs that are very difficult to separate from the correct ones. When testing an LLM component, for example, it is no longer possible to have a unit test that compares a single output to what’s expected, character by character, as each time your testing code calls the LLM you will get slightly different results, even when using the exact same prompt.
Key approaches to testing AI applications
Manual testing
The very first steps in testing AI are basically ad-hoc or manual prompt tests where a person writes a prompt, looks at the output, and decides if the results look good. Manual testing is an important step when working on a new model – a human should always sanity-check any outputs of an AI app to see if they produce expected results with different prompts, including prompts containing mistakes or unexpected data.
Unfortunately, manual tests can't be your only testing method once your AI app moves from development to production, as manual testing doesn’t scale well. If you want to do manual testing well, it becomes too time-consuming, where a person or even a team of people isn’t able to cover the full range of possible inputs and outputs to ensure the application’s quality.
Automated tests
In addition to manual testing, production AI applications require a data-driven approach to testing: supplying many inputs, analyzing the many outputs in aggregate, and deciding whether the application is usable or not based on the characteristics of the outputs.
The suitability of a test then depends on the chosen data characteristics. The better you align the test metrics to the behavior you want to cover, the clearer the results of testing will be. The characteristics traditionally picked for model evaluation, such as truthfulness of answers and the ability of an LLM to complete a sentence, might be too broad, and too far removed from what you need to test your application, to be useful.
If you’re building an application that is supposed to answer user questions about building roads, for example, you will likely care about the system’s answers in that specific domain, but most model evaluations rarely cover domain-specific performance. Instead, you need to do custom evaluation of the model in your specific domain.
Testing in a data-driven fashion also requires lots of input data. It’s rarely feasible to have this come directly from developers or users. So, in many cases, trustworthy synthetic data needs to be used for input generation at scale. To get hold of it, it’s common to use an LLM to create more variations synthetically, based on a set of manually-created seed data. If your application depends on industry-specific data, such as in our road construction example, generating synthetic input data can make (if adequate) or break (if inadequate) your entire testing paradigm.
Besides selecting the criteria that you’ll test against, you might also need to be intentional about what “correct” looks like. You will rarely have 100%-strict criteria, such as “the model should always return a correct answer”, because, as we explained before, testing a model is probabilistic rather than deterministic. Perhaps being correct in 80% of cases is good enough for your use case, or maybe you need to aim for 95% — and the architectures required to support these correctness criteria may be quite different.
Finally, the whole point of testing is not only to understand if your application is working correctly or not; it’s to provide a feedback loop to the development team with indications on what they can improve or fix to address a specific kind of behavior in their application.
How to test an AI application using Okareo
Let’s look at how to automatically test an AI application using the methodology we described. Using Okareo, you will create an automated test that generates a set of possible inputs for an app, and then run those inputs through the app and verify the outputs automatically.
We decided to focus this tutorial on building an application that uses an LLM to deduce a coffee’s processing method from the tasting notes on the final package. For example, if the taste notes written on a coffee package say “peach, jasmine, black tea”, the user would input that information into a simple form in the app and get the response “Most likely processing method: Washed or Anaerobic” — as in this case the washed, or wet, coffee processing method is most likely to make the coffee taste this way, but it would also be possible to have anaerobically processed coffee to taste this way.
Start by creating a simple app with TypeScript, Next.js, and React. The core API function reaches out to the OpenAI API and supplies a prompt designed to get the kind of response you require:
import OpenAI from "openai";
const openai = new OpenAI();
export const SYSTEM_PROMPT = `You are a coffee Q grader with lots of experience in the speciality coffee field.`
export function user_prompt(tasteNotes: string) : string {
return `I have brewed some coffee, and the notes I taste are:${tasteNotes}.
Please respond with the words "most likely processing method" followed
by the most likely processing method used for this coffee. Be brief in
the response, just mention the processing method name and no extra
information. If multiple methods are likely, then mention the most
likely options, but omit any extra information.`
}
async function tasteQuery(tasteNotes: string) : Promise<string> {
const { data: completion, response } = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: SYSTEM_PROMPT },
{ role
We then create a small front end using React to enable the user to submit a form, and have the form’s contents be added to the prompt. (This approach has security limitations, but we’re ignoring them at this stage.)
You can find the complete code example that you can run in our GitHub repo.
Setting up an initial test
For testing this application, we will be using Jest, and we’ll start by adding a simple deterministic unit test, to cover our prompt generation function (without including the AI functionality yet):
import generate, { SYSTEM_PROMPT, user_prompt } from "../pages/api/generate";
import { Okareo, OpenAIModel } from "okareo-ts-sdk"
const OKAREO_API_KEY = process.env.OKAREO_API_KEY || "";
const OPENAI_API_KEY = process.env.OPENAI_API_KEY || "";
describe('Prompt concatenation function works', () => {
it('Should return a prompt that includes our prompt template text', () => {
let tasteNotes = "peach, jasmine";
expect(user_prompt(tasteNotes)).toContain("I have brewed some coffee")
})
})
We then run this test using Jest:
$ npm test
> openai-okareo-coffee-app@0.1.0 test
> jest
PASS test/tastenotes.test.tsx
Prompt concatenation function works
✓ Should return a prompt that includes our prompt template text (1 ms)
Test Suites: 1 passed, 1 total
Tests: 1 passed, 1 total
Snapshots: 0 total
Time: 1
Adding a behavioral test
Now we’ll test the behavior of the AI part of our application. The core value that our application provides to the user, of course, is accurate answers — so we’ll test the accuracy of answers using for a sample set with known-good answers that we’ll provide ahead of time.
To accomplish this, we add a test using Okareo. We start by defining the test header and noting the Okareo project ID so that all results from testing this project end up in the same place in the Okareo dashboard:
describe('Answer generation', () => {
it('Sshould return a reasonable answer', async () => {
const projects: any[] = await okareo.getProjects();
const project_id = projects.find((p) => p.name === PROJECT_NAME)?.id;
...
Next, we supply write down some seed data (known-good answers) and then use Okareo’s Scenario Set functionality to augment the seed data with a further nine9 synthetic example user input phrases. The reason why Okareo creates more synthetic data is to make the tests cover a slightly broader set of inputs and thus provide better testing coverage.
...
const TEST_SEED_DATA = [
SeedData({ input: "lemon zest", result: "Washed" }),
SeedData({ input: "Spicy, black pepper", result: "Natural" }),
SeedData({ input: "caramel", result: "Honey" }),
];
const scenario: any = await okareo.create_scenario_set({
name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
project_id: project_id,
seed_data: TEST_SEED_DATA,
});
...
We now need to register our model with Okareo and point it to our system and user prompt templates:
...
const model = await okareo.register_model({
name: MODEL_NAME,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
models: {
type: "openai",
model_id: "gpt-4o",
temperature: 0.1,
system_prompt_template: SYSTEM_PROMPT,
user_prompt_template: USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});
...
Now we can run the evaluation with all the parameters we included:
...
const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: `${MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: scenario,
calculate_metrics: true,
type: TestRunType.MULTI_CLASS_CLASSIFICATION,
} as RunTestProps
);
...
The results of the evaluation will be stored in the eval_run
variable, and we can now generate a test report based on that variable’s contents:
...
const report_definition = {
error_max: 8,
metrics_min: {
precision: 0.5,
recall: 0.5,
f1: 0.5,
accuracy: 0.5,
},
};
expect(eval_run.model_metrics)
const reporter = new ClassificationReporter({
eval_run:eval_run, });
...report_definition,
});
Finally, we can add an assertion that makes the test pass when the Okareo evaluation passes, and fail when the Okareo evaluation fails:
await expect(reporter.pass).toBeTruthy;
We can now re-run the test in Jest and get a more complete result:
$ npm run test
> openai-okareo-coffee-app@0.1.0 test
> jest --config ./jest.okareo-config.js
PASS test/tastenotes.test.tsx (24.791 s)
Prompt concatenation function works
✓ Should return a prompt that includes our prompt template text (1 ms)
Answer generation
✓ Sshould return a reasonable answer (24033 ms)
Test Suites: 1 passed, 1 total
Tests: 2 passed, 2 total
Snapshots: 0 total
Time: 24
We can see the more detailed results of the evaluation in Okareo’s interface:
Try Okareo for testing your AI application
If you’d like to try testing your AI application with Okareo, sign up free and then follow our documentation to get started.
You can find the full repo with the example above that you can clone and try out here.
Testing AI applications is tricky because they’re non-deterministic. Here’s how to use Okareo to test the output of your AI-powered apps.
If you’re building an application that takes advantage of AI and LLMs, it's just as important to thoroughly test your code as it is with any other type of application. However, there are a number of issues that make it trickier to test AI apps than non-AI apps.
When writing traditional software, it’s straightforward to write some code for a piece of functionality and then write a deterministic test to ensure that this functionality works. For an AI application, however, a “test” isn’t such a simple concept. First, LLMs are not deterministic, so if you're writing a test around an LLM, you can’t always expect the same output. This is a problem as standard software tests expect a deterministic output.
Second, it is unlikely that you are using just one model, and the unpredictability when combining multiple models can skyrocket. Third, components such as RAG or agent architecture might influence the outputs significantly and aren’t easy to test. And fourth, reliance on third-party APIs may cause additional uncertainty around end user experience — for example, when the provider updates their model or changes its training data.
While it is not straightforward to test AI applications, there are ways. In this article, we show how you can do it using Okareo, a platform for testing LLM applications. We demonstrate with a simple Next.js AI app (powered by GPT-4o) that helps users identify the processing method of a particular coffee based on the tasting notes written on its packaging.
The key difference between testing non-AI and AI applications
The overall testing paradigm is well established in the software world. Generally, you would have tests structured around a pyramid-like structure with many unit tests (function level), some integration tests, and a few end-to-end tests.
Testing non-AI software: The key point when testing regular software is: you can supply one input, and expect to receive one correct output. You can even simulate an entire end-to-end scenario with negligible differences from one execution to another. The fact that most code produces deterministic output makes it possible to write tests that supply one input to a function or make an API call, and expect a certain response that you can directly compare character by character.
Testing AI software: Unlike regular software, AI apps are harder to test because of their non-deterministic output. With AI testing, a model might have many correct inputs, many correct outputs, and many incorrect outputs that are very difficult to separate from the correct ones. When testing an LLM component, for example, it is no longer possible to have a unit test that compares a single output to what’s expected, character by character, as each time your testing code calls the LLM you will get slightly different results, even when using the exact same prompt.
Key approaches to testing AI applications
Manual testing
The very first steps in testing AI are basically ad-hoc or manual prompt tests where a person writes a prompt, looks at the output, and decides if the results look good. Manual testing is an important step when working on a new model – a human should always sanity-check any outputs of an AI app to see if they produce expected results with different prompts, including prompts containing mistakes or unexpected data.
Unfortunately, manual tests can't be your only testing method once your AI app moves from development to production, as manual testing doesn’t scale well. If you want to do manual testing well, it becomes too time-consuming, where a person or even a team of people isn’t able to cover the full range of possible inputs and outputs to ensure the application’s quality.
Automated tests
In addition to manual testing, production AI applications require a data-driven approach to testing: supplying many inputs, analyzing the many outputs in aggregate, and deciding whether the application is usable or not based on the characteristics of the outputs.
The suitability of a test then depends on the chosen data characteristics. The better you align the test metrics to the behavior you want to cover, the clearer the results of testing will be. The characteristics traditionally picked for model evaluation, such as truthfulness of answers and the ability of an LLM to complete a sentence, might be too broad, and too far removed from what you need to test your application, to be useful.
If you’re building an application that is supposed to answer user questions about building roads, for example, you will likely care about the system’s answers in that specific domain, but most model evaluations rarely cover domain-specific performance. Instead, you need to do custom evaluation of the model in your specific domain.
Testing in a data-driven fashion also requires lots of input data. It’s rarely feasible to have this come directly from developers or users. So, in many cases, trustworthy synthetic data needs to be used for input generation at scale. To get hold of it, it’s common to use an LLM to create more variations synthetically, based on a set of manually-created seed data. If your application depends on industry-specific data, such as in our road construction example, generating synthetic input data can make (if adequate) or break (if inadequate) your entire testing paradigm.
Besides selecting the criteria that you’ll test against, you might also need to be intentional about what “correct” looks like. You will rarely have 100%-strict criteria, such as “the model should always return a correct answer”, because, as we explained before, testing a model is probabilistic rather than deterministic. Perhaps being correct in 80% of cases is good enough for your use case, or maybe you need to aim for 95% — and the architectures required to support these correctness criteria may be quite different.
Finally, the whole point of testing is not only to understand if your application is working correctly or not; it’s to provide a feedback loop to the development team with indications on what they can improve or fix to address a specific kind of behavior in their application.
How to test an AI application using Okareo
Let’s look at how to automatically test an AI application using the methodology we described. Using Okareo, you will create an automated test that generates a set of possible inputs for an app, and then run those inputs through the app and verify the outputs automatically.
We decided to focus this tutorial on building an application that uses an LLM to deduce a coffee’s processing method from the tasting notes on the final package. For example, if the taste notes written on a coffee package say “peach, jasmine, black tea”, the user would input that information into a simple form in the app and get the response “Most likely processing method: Washed or Anaerobic” — as in this case the washed, or wet, coffee processing method is most likely to make the coffee taste this way, but it would also be possible to have anaerobically processed coffee to taste this way.
Start by creating a simple app with TypeScript, Next.js, and React. The core API function reaches out to the OpenAI API and supplies a prompt designed to get the kind of response you require:
import OpenAI from "openai";
const openai = new OpenAI();
export const SYSTEM_PROMPT = `You are a coffee Q grader with lots of experience in the speciality coffee field.`
export function user_prompt(tasteNotes: string) : string {
return `I have brewed some coffee, and the notes I taste are:${tasteNotes}.
Please respond with the words "most likely processing method" followed
by the most likely processing method used for this coffee. Be brief in
the response, just mention the processing method name and no extra
information. If multiple methods are likely, then mention the most
likely options, but omit any extra information.`
}
async function tasteQuery(tasteNotes: string) : Promise<string> {
const { data: completion, response } = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: SYSTEM_PROMPT },
{ role
We then create a small front end using React to enable the user to submit a form, and have the form’s contents be added to the prompt. (This approach has security limitations, but we’re ignoring them at this stage.)
You can find the complete code example that you can run in our GitHub repo.
Setting up an initial test
For testing this application, we will be using Jest, and we’ll start by adding a simple deterministic unit test, to cover our prompt generation function (without including the AI functionality yet):
import generate, { SYSTEM_PROMPT, user_prompt } from "../pages/api/generate";
import { Okareo, OpenAIModel } from "okareo-ts-sdk"
const OKAREO_API_KEY = process.env.OKAREO_API_KEY || "";
const OPENAI_API_KEY = process.env.OPENAI_API_KEY || "";
describe('Prompt concatenation function works', () => {
it('Should return a prompt that includes our prompt template text', () => {
let tasteNotes = "peach, jasmine";
expect(user_prompt(tasteNotes)).toContain("I have brewed some coffee")
})
})
We then run this test using Jest:
$ npm test
> openai-okareo-coffee-app@0.1.0 test
> jest
PASS test/tastenotes.test.tsx
Prompt concatenation function works
✓ Should return a prompt that includes our prompt template text (1 ms)
Test Suites: 1 passed, 1 total
Tests: 1 passed, 1 total
Snapshots: 0 total
Time: 1
Adding a behavioral test
Now we’ll test the behavior of the AI part of our application. The core value that our application provides to the user, of course, is accurate answers — so we’ll test the accuracy of answers using for a sample set with known-good answers that we’ll provide ahead of time.
To accomplish this, we add a test using Okareo. We start by defining the test header and noting the Okareo project ID so that all results from testing this project end up in the same place in the Okareo dashboard:
describe('Answer generation', () => {
it('Sshould return a reasonable answer', async () => {
const projects: any[] = await okareo.getProjects();
const project_id = projects.find((p) => p.name === PROJECT_NAME)?.id;
...
Next, we supply write down some seed data (known-good answers) and then use Okareo’s Scenario Set functionality to augment the seed data with a further nine9 synthetic example user input phrases. The reason why Okareo creates more synthetic data is to make the tests cover a slightly broader set of inputs and thus provide better testing coverage.
...
const TEST_SEED_DATA = [
SeedData({ input: "lemon zest", result: "Washed" }),
SeedData({ input: "Spicy, black pepper", result: "Natural" }),
SeedData({ input: "caramel", result: "Honey" }),
];
const scenario: any = await okareo.create_scenario_set({
name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
project_id: project_id,
seed_data: TEST_SEED_DATA,
});
...
We now need to register our model with Okareo and point it to our system and user prompt templates:
...
const model = await okareo.register_model({
name: MODEL_NAME,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
models: {
type: "openai",
model_id: "gpt-4o",
temperature: 0.1,
system_prompt_template: SYSTEM_PROMPT,
user_prompt_template: USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});
...
Now we can run the evaluation with all the parameters we included:
...
const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: `${MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: scenario,
calculate_metrics: true,
type: TestRunType.MULTI_CLASS_CLASSIFICATION,
} as RunTestProps
);
...
The results of the evaluation will be stored in the eval_run
variable, and we can now generate a test report based on that variable’s contents:
...
const report_definition = {
error_max: 8,
metrics_min: {
precision: 0.5,
recall: 0.5,
f1: 0.5,
accuracy: 0.5,
},
};
expect(eval_run.model_metrics)
const reporter = new ClassificationReporter({
eval_run:eval_run, });
...report_definition,
});
Finally, we can add an assertion that makes the test pass when the Okareo evaluation passes, and fail when the Okareo evaluation fails:
await expect(reporter.pass).toBeTruthy;
We can now re-run the test in Jest and get a more complete result:
$ npm run test
> openai-okareo-coffee-app@0.1.0 test
> jest --config ./jest.okareo-config.js
PASS test/tastenotes.test.tsx (24.791 s)
Prompt concatenation function works
✓ Should return a prompt that includes our prompt template text (1 ms)
Answer generation
✓ Sshould return a reasonable answer (24033 ms)
Test Suites: 1 passed, 1 total
Tests: 2 passed, 2 total
Snapshots: 0 total
Time: 24
We can see the more detailed results of the evaluation in Okareo’s interface:
Try Okareo for testing your AI application
If you’d like to try testing your AI application with Okareo, sign up free and then follow our documentation to get started.
You can find the full repo with the example above that you can clone and try out here.