LLM Evaluation Metrics

Evaluation

Matt Wyman

,

CEO/Co-Founder

Rachael Churchill

,

Technical Content Writer

November 22, 2024

To verify that your LLMs or LLM-based apps are performing correctly, you need objective evaluation metrics. This article explores the kind of LLM evaluation metrics that exist and which ones are right for your use case. It explains how to use Okareo to evaluate your LLM according to these metrics and others of your choice.

Why you need LLM evaluation metrics

As with any other apps, you need to evaluate LLM-based apps to ensure they're achieving the purpose for which you implemented them, and to make sure the model’s output doesn't get worse when you change something (for example, replacing the model, retraining it, or changing your system prompt). In the case of LLMs in particular, you also need to be on guard against model drift, where the model’s output gets worse over time even without you changing anything.

You also need metrics that are objective and quantitative, so that you can:

  • evaluate your LLM app in a consistent and unbiased way.

  • measure changes over time.

  • correlate changes (such as improved or worsened output) with changes you make (for example, to your prompts).

  • compare one model with another.

Finally, the metrics need to be easily automatable so you can schedule regular, frequent checks and see how things change over time, and so that you can integrate the metrics into your existing testing infrastructure, including CI/CD. 

Types of LLM evaluation metrics

You can divide LLM evaluation metrics into reference-based and reference-free metrics. Reference-based metrics compare your LLM’s output against a “gold standard,” or example of expected output. For example, BLEU score (which is commonly used in translation tasks) measures the overlap of n-grams between the text generated by an LLM and some reference text. By contrast, reference-free metrics evaluate the output in isolation. For example, measuring consistency can tell you how well your model generates stable, reliable responses across similar inputs without needing an expected output to compare against.

Separately, you can also divide them into deterministic and non-deterministic metrics. 

Deterministic metrics, such as word count or character count, are straightforward and clear-cut: if you measure the same output multiple times, the result will always be the same. They also include structural validation metrics for code generation, which check that the output is in a specific format: for example, that it conforms to the JSON specification, or that the JSON contains certain required terms. 

Non-deterministic metrics, such as the coherence or friendliness of a piece of generated text, are usually more subjective. These previously required a human judge, but now they can be judged by another LLM.

A table showing the differences between reference-based and reference-free metrics, and deterministic and non-deterministic metrics.

There are also metrics that measure efficiency or performance: for example, latency or inference time, or usage of memory or other resources. These are external to the output, unlike the metrics above, which all measure properties of the output.

Which LLM evaluation metrics should you use?

The evaluation metrics you use will depend on your use case and what you want to measure.

Reference-based vs. reference-free metrics

Reference-free metrics are often more versatile than reference-based ones, because reference-based metrics need a “gold standard,” or ground truth, to compare against, and these need to come from somewhere. This need increases overheads and limits the volume of testing you can do. It also limits the kinds of things you can test if you’re only testing similarity (in whatever sense) to a gold standard document. However, for some use cases, this may be what you want. For example, if you’re building an app for translation or summarization, you will want to check that your output is sufficiently similar in meaning to the original.

With reference-free metrics, you can evaluate your output in isolation. There’s no gold standard document needed, so you don't need a human in the loop to create it or evaluate its suitability. Reference-free metrics also increase the range of things you can test: for example, friendliness, politeness, or fluency, which are properties of the output on its own, not in reference to another document. 

Reference-free metrics also allow for testing at much higher volumes, through strategies like creating an initial dataset of test cases and using it as a seed to generate large numbers of similar test cases.

Deterministic vs. non-deterministic metrics

Deterministic metrics are good for any property of the output that can be defined programmatically. If the output is plain text, this could be word or character counts. If you’re generating code, you might want to check that it’s valid JSON or syntactically correct Python. You could then check for the presence or quantity of any given properties in the generated JSON or function calls or arguments in the generated Python code.

Non-deterministic metrics are good for almost everything else. For example, you could evaluate the output for fluency (correct spelling, punctuation, and grammar) or for coherence (how good the structure and organization are). You could check the relevance of the output to the user’s query, or you could check the consistency between the input and the output, using entailment-based metrics to determine whether the output text entails, contradicts, or undermines the premise, and thus detect inconsistency.

It’s important to use the right tool for the job. Non-deterministic metrics are exciting because they’re more advanced and sophisticated than deterministic ones, but if what you care about is the length of the output or whether it contains certain keywords, you don’t need them or the extra overhead they come with in the form of the second LLM that’s used to judge the quality of your LLM.

Using LLM evaluation metrics for exploratory data analysis

Although LLM evaluation metrics are useful for CI/CD and for regression testing — to make sure the output of your model isn’t getting worse due to changes you’ve made or over time due to model drift — that's not their only purpose.

You can also use them as an exploratory tool to get a sense of the distribution of your app’s output, particularly during the early stages of building your app. This allows you to understand the behavior of your app better. In particular, it gives you insight into whether small changes in your system prompt or other settings are likely to lead to small or large changes in the output. This means you can gauge its robustness (if you’re broadly happy with its behavior and don’t want it to change much) — or, if you do want it to change, you can gauge how achievable that is.

Continuous metrics are the most effective type for exploratory data analysis, as opposed to binary pass/fail metrics that are not granular enough to convey the distribution of things like text length, verbosity, or conciseness. 

Consider an LLM app developer who is using both a “word count” metric and a “below 256 characters?” metric in their test suite. That could seem redundant at first glance, but “below 256 characters?” is a binary pass/fail metric and doesn’t give you a sense of the distribution of the length of the output. If it meets a given character count threshold half the time, that doesn’t tell you whether the results are all tightly clustered either side of that threshold, normally distributed around the threshold, heavily skewed, or randomly and uniformly distributed all over the place.

Some possible distributions of the length of the output from an LLM app.

It’s useful to know which of these your output resembles. If it’s normal or uniform, a small change to the model or the system prompt probably won’t make much difference to your binary pass/fail metric; but if the distribution is tightly clustered or skewed, a small change could make the difference between mostly passing and mostly failing.

Also, since LLMs work on tokens, they operate at a word or part-word level — they can’t directly see or control the number of characters in their output. For example, you may be interested in measuring both the word count and the character count; if you sometimes use a system prompt that encourages formal writing and longer words and other times use one that encourages short, simple words, then the word count may be poorly correlated with the character count.

Creating LLM evaluation metrics using Okareo

Okareo tests all the metrics discussed above out of the box, and also lets you define your own custom checks. You can create any type of metrics with Okareo, including ones that use LLM as a judge, such as friendliness, and ones you define yourself.

The simplest way to experiment is in the Okareo web app. Go to Checks and browse the list of built-in options, which includes examples of different kinds of checks: deterministic ones like is_json or does_code_compile,and non-deterministic ones like coherence or fluency.

You can also create your own checks in Okareo. A check can be either a CodeBasedCheck deterministically implemented in code, or a ModelBasedCheck, where the check is described using a prompt and evaluated by an LLM judge. 

In the web app, you can do this using the Create Check button. You can type a description of the check you want, and Okareo will use an LLM to immediately generate the check in Python code, which can then be used to evaluate the check automatically at evaluation time.

Below is an example of a user-created check that verifies whether the output is valid JSON with the properties short_summary, actions, and attendee_list. Okareo has generated the Python code for deterministically checking this. Once you’ve created a check, you can click on it and a modal will appear showing similar code.

Screenshot of the Python code generated by Okareo for the JSON check.

Once all your checks have been defined in Okareo (either in-app or programmatically), you can run an evaluation with your custom checks using flows (scripts written in Python or TypeScript) and config files. 

Running an LLM evaluation using an Okareo flow

Below is an example of an Okareo flow for running an LLM evaluation that you can follow along with. You'll need to install the Okareo CLI locally to run your flows.

This particular example is in TypeScript and the full code for this example is available on our GitHub. The example applies non-deterministic checks for coherence, consistency, fluency, and relevance to the output of a meeting summarizer app using an LLM judge. 

The code below shows that there are three main steps to running an LLM evaluation in Okareo:

  1. Create a scenario set: This is an array of inputs, each paired with a result. For reference-based metrics, the result is the gold-standard reference. For reference-free metrics, you can just pass in an empty string.

  2. Register your model with Okareo: This example uses OpenAI's GPT-3.5 Turbo model.

  3. Run the evaluation: Use the run_test method to run the evaluation, passing in the array of checks you want to use.

import {
Okareo,
RunTestProps,
components,
SeedData,
      TestRunType,
      OpenAIModel,
      GenerationReporter,
} from "okareo-ts-sdk";

const OKAREO_API_KEY = process.env.OKAREO_API_KEY;
const OPENAI_API_KEY = process.env.OPENAI_API_KEY;

const UNIQUE_BUILD_ID = (process.env.DEMO_BUILD_ID || `local.${(Math.random() + 1).toString(36).substring(7)}`);

const PROJECT_NAME = "Global";
const MODEL_NAME = "Text Summarizer";
const SCENARIO_SET_NAME = "Webbizz Articles for Text Summarization";

const USER_PROMPT_TEMPLATE = "{scenario_input}"
const SUMMARIZATION_CONTEXT_TEMPLATE = "You will be provided with text. Summarize the text in 1 simple sentence."

const main = async () => {

const okareo = new Okareo({api_key: OKAREO_API_KEY });
const project: any[] = await okareo.getProjects();
const project_id = project.find(p => p.name === PROJECT_NAME)?.id;

    // 1. Create scenario set
    const SCENARIO_SET = [
    SeedData({
        input:"WebBizz is dedicated to providing our customers with a seamless online shopping experience. Our platform is designed with user-friendly interfaces to help you browse and select the best products suitable for your needs. We offer a wide range of products from top brands and new entrants, ensuring diversity and quality in our offerings. Our 24/7 customer support is ready to assist you with any queries, from product details, shipping timelines, to payment methods. We also have a dedicated FAQ section addressing common concerns. Always ensure you are logged in to enjoy personalized product recommendations and faster checkout processes.",
            result:"WebBizz offers a diverse, user-friendly online shopping experience with 24/7 customer support and personalized features for a seamless purchase journey." 
    }),
    SeedData({
        input:"Safety and security of your data is our top priority at WebBizz. Our platform employs state-of-the-art encryption methods ensuring your personal and financial information remains confidential. Our two-factor authentication at checkout provides an added layer of security. We understand the importance of timely deliveries, hence we've partnered with reliable logistics partners ensuring your products reach you in pristine condition. In case of any delays or issues, our tracking tool can provide real-time updates on your product's location. We believe in transparency and guarantee no hidden fees or charges during your purchase journey.", 
        result:"WebBizz prioritizes data security, using encryption and two-factor authentication, while ensuring timely deliveries with real-time tracking and no hidden fees."
    }),
    SeedData({
        input:"WebBizz places immense value on its dedicated clientele, recognizing their loyalty through the exclusive 'Premium Club' membership. This special program is designed to enrich the shopping experience, providing a suite of benefits tailored to our valued members. Among the advantages, members enjoy complimentary shipping, granting them a seamless and cost-effective way to receive their purchases. Additionally, the 'Premium Club' offers early access to sales, allowing members to avail themselves of promotional offers before they are opened to the general public.",
        result:"WebBizz rewards loyal customers with its 'Premium Club' membership, offering free shipping and early access to sales for an enhanced shopping experience."
    })
];

    const scenario: any = await okareo.create_scenario_set(
        {
        name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
        project_id: project_id,
            seed_data: SCENARIO_SET
        }
    );


    // 2. Register your LLM with Okareo
    const model = await okareo.register_model({
name: MODEL_NAME,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
models: {
type: "openai",
model_id:"gpt-3.5-turbo",
temperature:0.5,
system_prompt_template:SUMMARIZATION_CONTEXT_TEMPLATE,
user_prompt_template:USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});

    // 3. Run your LLM evaluation
const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: `${MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: scenario,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: [
"coherence_summary",
"consistency_summary",
"fluency_summary",
"relevance_summary"
]
} as RunTestProps);

// Print a direct link to the evaluation report in Okareo (for convenience)
console.log(`See results in Okareo: ${eval_run.app_link}`);

}
main();

You can run your flow with the okareo run -f <YOUR_FLOW_SCRIPT_NAME> command, and then click on the printed link to view graphs of the results of your checks in Okareo’s web interface so you can visualize the distributions. This can help you better understand the behavior of your app, as discussed above.

Screenshot of results from the LLM-as-a-judge checks for coherence, consistency, fluency, and relevance.

If you want to automate things even further, you can also easily integrate your Okareo LLM evaluation flow scripts into your existing CI/CD system.

LLM evaluation metrics will help you improve your app

LLM evaluation metrics will enable you to explore and understand the behavior of your app and therefore help you improve it.

It’s important to evaluate your LLM app in a way that’s quantifiable, objective, and easily automated. Okareo provides a wide variety of LLM evaluation metrics for different use cases, as well as the ability to define your own. You can try Okareo for free here.

To verify that your LLMs or LLM-based apps are performing correctly, you need objective evaluation metrics. This article explores the kind of LLM evaluation metrics that exist and which ones are right for your use case. It explains how to use Okareo to evaluate your LLM according to these metrics and others of your choice.

Why you need LLM evaluation metrics

As with any other apps, you need to evaluate LLM-based apps to ensure they're achieving the purpose for which you implemented them, and to make sure the model’s output doesn't get worse when you change something (for example, replacing the model, retraining it, or changing your system prompt). In the case of LLMs in particular, you also need to be on guard against model drift, where the model’s output gets worse over time even without you changing anything.

You also need metrics that are objective and quantitative, so that you can:

  • evaluate your LLM app in a consistent and unbiased way.

  • measure changes over time.

  • correlate changes (such as improved or worsened output) with changes you make (for example, to your prompts).

  • compare one model with another.

Finally, the metrics need to be easily automatable so you can schedule regular, frequent checks and see how things change over time, and so that you can integrate the metrics into your existing testing infrastructure, including CI/CD. 

Types of LLM evaluation metrics

You can divide LLM evaluation metrics into reference-based and reference-free metrics. Reference-based metrics compare your LLM’s output against a “gold standard,” or example of expected output. For example, BLEU score (which is commonly used in translation tasks) measures the overlap of n-grams between the text generated by an LLM and some reference text. By contrast, reference-free metrics evaluate the output in isolation. For example, measuring consistency can tell you how well your model generates stable, reliable responses across similar inputs without needing an expected output to compare against.

Separately, you can also divide them into deterministic and non-deterministic metrics. 

Deterministic metrics, such as word count or character count, are straightforward and clear-cut: if you measure the same output multiple times, the result will always be the same. They also include structural validation metrics for code generation, which check that the output is in a specific format: for example, that it conforms to the JSON specification, or that the JSON contains certain required terms. 

Non-deterministic metrics, such as the coherence or friendliness of a piece of generated text, are usually more subjective. These previously required a human judge, but now they can be judged by another LLM.

A table showing the differences between reference-based and reference-free metrics, and deterministic and non-deterministic metrics.

There are also metrics that measure efficiency or performance: for example, latency or inference time, or usage of memory or other resources. These are external to the output, unlike the metrics above, which all measure properties of the output.

Which LLM evaluation metrics should you use?

The evaluation metrics you use will depend on your use case and what you want to measure.

Reference-based vs. reference-free metrics

Reference-free metrics are often more versatile than reference-based ones, because reference-based metrics need a “gold standard,” or ground truth, to compare against, and these need to come from somewhere. This need increases overheads and limits the volume of testing you can do. It also limits the kinds of things you can test if you’re only testing similarity (in whatever sense) to a gold standard document. However, for some use cases, this may be what you want. For example, if you’re building an app for translation or summarization, you will want to check that your output is sufficiently similar in meaning to the original.

With reference-free metrics, you can evaluate your output in isolation. There’s no gold standard document needed, so you don't need a human in the loop to create it or evaluate its suitability. Reference-free metrics also increase the range of things you can test: for example, friendliness, politeness, or fluency, which are properties of the output on its own, not in reference to another document. 

Reference-free metrics also allow for testing at much higher volumes, through strategies like creating an initial dataset of test cases and using it as a seed to generate large numbers of similar test cases.

Deterministic vs. non-deterministic metrics

Deterministic metrics are good for any property of the output that can be defined programmatically. If the output is plain text, this could be word or character counts. If you’re generating code, you might want to check that it’s valid JSON or syntactically correct Python. You could then check for the presence or quantity of any given properties in the generated JSON or function calls or arguments in the generated Python code.

Non-deterministic metrics are good for almost everything else. For example, you could evaluate the output for fluency (correct spelling, punctuation, and grammar) or for coherence (how good the structure and organization are). You could check the relevance of the output to the user’s query, or you could check the consistency between the input and the output, using entailment-based metrics to determine whether the output text entails, contradicts, or undermines the premise, and thus detect inconsistency.

It’s important to use the right tool for the job. Non-deterministic metrics are exciting because they’re more advanced and sophisticated than deterministic ones, but if what you care about is the length of the output or whether it contains certain keywords, you don’t need them or the extra overhead they come with in the form of the second LLM that’s used to judge the quality of your LLM.

Using LLM evaluation metrics for exploratory data analysis

Although LLM evaluation metrics are useful for CI/CD and for regression testing — to make sure the output of your model isn’t getting worse due to changes you’ve made or over time due to model drift — that's not their only purpose.

You can also use them as an exploratory tool to get a sense of the distribution of your app’s output, particularly during the early stages of building your app. This allows you to understand the behavior of your app better. In particular, it gives you insight into whether small changes in your system prompt or other settings are likely to lead to small or large changes in the output. This means you can gauge its robustness (if you’re broadly happy with its behavior and don’t want it to change much) — or, if you do want it to change, you can gauge how achievable that is.

Continuous metrics are the most effective type for exploratory data analysis, as opposed to binary pass/fail metrics that are not granular enough to convey the distribution of things like text length, verbosity, or conciseness. 

Consider an LLM app developer who is using both a “word count” metric and a “below 256 characters?” metric in their test suite. That could seem redundant at first glance, but “below 256 characters?” is a binary pass/fail metric and doesn’t give you a sense of the distribution of the length of the output. If it meets a given character count threshold half the time, that doesn’t tell you whether the results are all tightly clustered either side of that threshold, normally distributed around the threshold, heavily skewed, or randomly and uniformly distributed all over the place.

Some possible distributions of the length of the output from an LLM app.

It’s useful to know which of these your output resembles. If it’s normal or uniform, a small change to the model or the system prompt probably won’t make much difference to your binary pass/fail metric; but if the distribution is tightly clustered or skewed, a small change could make the difference between mostly passing and mostly failing.

Also, since LLMs work on tokens, they operate at a word or part-word level — they can’t directly see or control the number of characters in their output. For example, you may be interested in measuring both the word count and the character count; if you sometimes use a system prompt that encourages formal writing and longer words and other times use one that encourages short, simple words, then the word count may be poorly correlated with the character count.

Creating LLM evaluation metrics using Okareo

Okareo tests all the metrics discussed above out of the box, and also lets you define your own custom checks. You can create any type of metrics with Okareo, including ones that use LLM as a judge, such as friendliness, and ones you define yourself.

The simplest way to experiment is in the Okareo web app. Go to Checks and browse the list of built-in options, which includes examples of different kinds of checks: deterministic ones like is_json or does_code_compile,and non-deterministic ones like coherence or fluency.

You can also create your own checks in Okareo. A check can be either a CodeBasedCheck deterministically implemented in code, or a ModelBasedCheck, where the check is described using a prompt and evaluated by an LLM judge. 

In the web app, you can do this using the Create Check button. You can type a description of the check you want, and Okareo will use an LLM to immediately generate the check in Python code, which can then be used to evaluate the check automatically at evaluation time.

Below is an example of a user-created check that verifies whether the output is valid JSON with the properties short_summary, actions, and attendee_list. Okareo has generated the Python code for deterministically checking this. Once you’ve created a check, you can click on it and a modal will appear showing similar code.

Screenshot of the Python code generated by Okareo for the JSON check.

Once all your checks have been defined in Okareo (either in-app or programmatically), you can run an evaluation with your custom checks using flows (scripts written in Python or TypeScript) and config files. 

Running an LLM evaluation using an Okareo flow

Below is an example of an Okareo flow for running an LLM evaluation that you can follow along with. You'll need to install the Okareo CLI locally to run your flows.

This particular example is in TypeScript and the full code for this example is available on our GitHub. The example applies non-deterministic checks for coherence, consistency, fluency, and relevance to the output of a meeting summarizer app using an LLM judge. 

The code below shows that there are three main steps to running an LLM evaluation in Okareo:

  1. Create a scenario set: This is an array of inputs, each paired with a result. For reference-based metrics, the result is the gold-standard reference. For reference-free metrics, you can just pass in an empty string.

  2. Register your model with Okareo: This example uses OpenAI's GPT-3.5 Turbo model.

  3. Run the evaluation: Use the run_test method to run the evaluation, passing in the array of checks you want to use.

import {
Okareo,
RunTestProps,
components,
SeedData,
      TestRunType,
      OpenAIModel,
      GenerationReporter,
} from "okareo-ts-sdk";

const OKAREO_API_KEY = process.env.OKAREO_API_KEY;
const OPENAI_API_KEY = process.env.OPENAI_API_KEY;

const UNIQUE_BUILD_ID = (process.env.DEMO_BUILD_ID || `local.${(Math.random() + 1).toString(36).substring(7)}`);

const PROJECT_NAME = "Global";
const MODEL_NAME = "Text Summarizer";
const SCENARIO_SET_NAME = "Webbizz Articles for Text Summarization";

const USER_PROMPT_TEMPLATE = "{scenario_input}"
const SUMMARIZATION_CONTEXT_TEMPLATE = "You will be provided with text. Summarize the text in 1 simple sentence."

const main = async () => {

const okareo = new Okareo({api_key: OKAREO_API_KEY });
const project: any[] = await okareo.getProjects();
const project_id = project.find(p => p.name === PROJECT_NAME)?.id;

    // 1. Create scenario set
    const SCENARIO_SET = [
    SeedData({
        input:"WebBizz is dedicated to providing our customers with a seamless online shopping experience. Our platform is designed with user-friendly interfaces to help you browse and select the best products suitable for your needs. We offer a wide range of products from top brands and new entrants, ensuring diversity and quality in our offerings. Our 24/7 customer support is ready to assist you with any queries, from product details, shipping timelines, to payment methods. We also have a dedicated FAQ section addressing common concerns. Always ensure you are logged in to enjoy personalized product recommendations and faster checkout processes.",
            result:"WebBizz offers a diverse, user-friendly online shopping experience with 24/7 customer support and personalized features for a seamless purchase journey." 
    }),
    SeedData({
        input:"Safety and security of your data is our top priority at WebBizz. Our platform employs state-of-the-art encryption methods ensuring your personal and financial information remains confidential. Our two-factor authentication at checkout provides an added layer of security. We understand the importance of timely deliveries, hence we've partnered with reliable logistics partners ensuring your products reach you in pristine condition. In case of any delays or issues, our tracking tool can provide real-time updates on your product's location. We believe in transparency and guarantee no hidden fees or charges during your purchase journey.", 
        result:"WebBizz prioritizes data security, using encryption and two-factor authentication, while ensuring timely deliveries with real-time tracking and no hidden fees."
    }),
    SeedData({
        input:"WebBizz places immense value on its dedicated clientele, recognizing their loyalty through the exclusive 'Premium Club' membership. This special program is designed to enrich the shopping experience, providing a suite of benefits tailored to our valued members. Among the advantages, members enjoy complimentary shipping, granting them a seamless and cost-effective way to receive their purchases. Additionally, the 'Premium Club' offers early access to sales, allowing members to avail themselves of promotional offers before they are opened to the general public.",
        result:"WebBizz rewards loyal customers with its 'Premium Club' membership, offering free shipping and early access to sales for an enhanced shopping experience."
    })
];

    const scenario: any = await okareo.create_scenario_set(
        {
        name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
        project_id: project_id,
            seed_data: SCENARIO_SET
        }
    );


    // 2. Register your LLM with Okareo
    const model = await okareo.register_model({
name: MODEL_NAME,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
models: {
type: "openai",
model_id:"gpt-3.5-turbo",
temperature:0.5,
system_prompt_template:SUMMARIZATION_CONTEXT_TEMPLATE,
user_prompt_template:USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});

    // 3. Run your LLM evaluation
const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: `${MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: scenario,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: [
"coherence_summary",
"consistency_summary",
"fluency_summary",
"relevance_summary"
]
} as RunTestProps);

// Print a direct link to the evaluation report in Okareo (for convenience)
console.log(`See results in Okareo: ${eval_run.app_link}`);

}
main();

You can run your flow with the okareo run -f <YOUR_FLOW_SCRIPT_NAME> command, and then click on the printed link to view graphs of the results of your checks in Okareo’s web interface so you can visualize the distributions. This can help you better understand the behavior of your app, as discussed above.

Screenshot of results from the LLM-as-a-judge checks for coherence, consistency, fluency, and relevance.

If you want to automate things even further, you can also easily integrate your Okareo LLM evaluation flow scripts into your existing CI/CD system.

LLM evaluation metrics will help you improve your app

LLM evaluation metrics will enable you to explore and understand the behavior of your app and therefore help you improve it.

It’s important to evaluate your LLM app in a way that’s quantifiable, objective, and easily automated. Okareo provides a wide variety of LLM evaluation metrics for different use cases, as well as the ability to define your own. You can try Okareo for free here.

To verify that your LLMs or LLM-based apps are performing correctly, you need objective evaluation metrics. This article explores the kind of LLM evaluation metrics that exist and which ones are right for your use case. It explains how to use Okareo to evaluate your LLM according to these metrics and others of your choice.

Why you need LLM evaluation metrics

As with any other apps, you need to evaluate LLM-based apps to ensure they're achieving the purpose for which you implemented them, and to make sure the model’s output doesn't get worse when you change something (for example, replacing the model, retraining it, or changing your system prompt). In the case of LLMs in particular, you also need to be on guard against model drift, where the model’s output gets worse over time even without you changing anything.

You also need metrics that are objective and quantitative, so that you can:

  • evaluate your LLM app in a consistent and unbiased way.

  • measure changes over time.

  • correlate changes (such as improved or worsened output) with changes you make (for example, to your prompts).

  • compare one model with another.

Finally, the metrics need to be easily automatable so you can schedule regular, frequent checks and see how things change over time, and so that you can integrate the metrics into your existing testing infrastructure, including CI/CD. 

Types of LLM evaluation metrics

You can divide LLM evaluation metrics into reference-based and reference-free metrics. Reference-based metrics compare your LLM’s output against a “gold standard,” or example of expected output. For example, BLEU score (which is commonly used in translation tasks) measures the overlap of n-grams between the text generated by an LLM and some reference text. By contrast, reference-free metrics evaluate the output in isolation. For example, measuring consistency can tell you how well your model generates stable, reliable responses across similar inputs without needing an expected output to compare against.

Separately, you can also divide them into deterministic and non-deterministic metrics. 

Deterministic metrics, such as word count or character count, are straightforward and clear-cut: if you measure the same output multiple times, the result will always be the same. They also include structural validation metrics for code generation, which check that the output is in a specific format: for example, that it conforms to the JSON specification, or that the JSON contains certain required terms. 

Non-deterministic metrics, such as the coherence or friendliness of a piece of generated text, are usually more subjective. These previously required a human judge, but now they can be judged by another LLM.

A table showing the differences between reference-based and reference-free metrics, and deterministic and non-deterministic metrics.

There are also metrics that measure efficiency or performance: for example, latency or inference time, or usage of memory or other resources. These are external to the output, unlike the metrics above, which all measure properties of the output.

Which LLM evaluation metrics should you use?

The evaluation metrics you use will depend on your use case and what you want to measure.

Reference-based vs. reference-free metrics

Reference-free metrics are often more versatile than reference-based ones, because reference-based metrics need a “gold standard,” or ground truth, to compare against, and these need to come from somewhere. This need increases overheads and limits the volume of testing you can do. It also limits the kinds of things you can test if you’re only testing similarity (in whatever sense) to a gold standard document. However, for some use cases, this may be what you want. For example, if you’re building an app for translation or summarization, you will want to check that your output is sufficiently similar in meaning to the original.

With reference-free metrics, you can evaluate your output in isolation. There’s no gold standard document needed, so you don't need a human in the loop to create it or evaluate its suitability. Reference-free metrics also increase the range of things you can test: for example, friendliness, politeness, or fluency, which are properties of the output on its own, not in reference to another document. 

Reference-free metrics also allow for testing at much higher volumes, through strategies like creating an initial dataset of test cases and using it as a seed to generate large numbers of similar test cases.

Deterministic vs. non-deterministic metrics

Deterministic metrics are good for any property of the output that can be defined programmatically. If the output is plain text, this could be word or character counts. If you’re generating code, you might want to check that it’s valid JSON or syntactically correct Python. You could then check for the presence or quantity of any given properties in the generated JSON or function calls or arguments in the generated Python code.

Non-deterministic metrics are good for almost everything else. For example, you could evaluate the output for fluency (correct spelling, punctuation, and grammar) or for coherence (how good the structure and organization are). You could check the relevance of the output to the user’s query, or you could check the consistency between the input and the output, using entailment-based metrics to determine whether the output text entails, contradicts, or undermines the premise, and thus detect inconsistency.

It’s important to use the right tool for the job. Non-deterministic metrics are exciting because they’re more advanced and sophisticated than deterministic ones, but if what you care about is the length of the output or whether it contains certain keywords, you don’t need them or the extra overhead they come with in the form of the second LLM that’s used to judge the quality of your LLM.

Using LLM evaluation metrics for exploratory data analysis

Although LLM evaluation metrics are useful for CI/CD and for regression testing — to make sure the output of your model isn’t getting worse due to changes you’ve made or over time due to model drift — that's not their only purpose.

You can also use them as an exploratory tool to get a sense of the distribution of your app’s output, particularly during the early stages of building your app. This allows you to understand the behavior of your app better. In particular, it gives you insight into whether small changes in your system prompt or other settings are likely to lead to small or large changes in the output. This means you can gauge its robustness (if you’re broadly happy with its behavior and don’t want it to change much) — or, if you do want it to change, you can gauge how achievable that is.

Continuous metrics are the most effective type for exploratory data analysis, as opposed to binary pass/fail metrics that are not granular enough to convey the distribution of things like text length, verbosity, or conciseness. 

Consider an LLM app developer who is using both a “word count” metric and a “below 256 characters?” metric in their test suite. That could seem redundant at first glance, but “below 256 characters?” is a binary pass/fail metric and doesn’t give you a sense of the distribution of the length of the output. If it meets a given character count threshold half the time, that doesn’t tell you whether the results are all tightly clustered either side of that threshold, normally distributed around the threshold, heavily skewed, or randomly and uniformly distributed all over the place.

Some possible distributions of the length of the output from an LLM app.

It’s useful to know which of these your output resembles. If it’s normal or uniform, a small change to the model or the system prompt probably won’t make much difference to your binary pass/fail metric; but if the distribution is tightly clustered or skewed, a small change could make the difference between mostly passing and mostly failing.

Also, since LLMs work on tokens, they operate at a word or part-word level — they can’t directly see or control the number of characters in their output. For example, you may be interested in measuring both the word count and the character count; if you sometimes use a system prompt that encourages formal writing and longer words and other times use one that encourages short, simple words, then the word count may be poorly correlated with the character count.

Creating LLM evaluation metrics using Okareo

Okareo tests all the metrics discussed above out of the box, and also lets you define your own custom checks. You can create any type of metrics with Okareo, including ones that use LLM as a judge, such as friendliness, and ones you define yourself.

The simplest way to experiment is in the Okareo web app. Go to Checks and browse the list of built-in options, which includes examples of different kinds of checks: deterministic ones like is_json or does_code_compile,and non-deterministic ones like coherence or fluency.

You can also create your own checks in Okareo. A check can be either a CodeBasedCheck deterministically implemented in code, or a ModelBasedCheck, where the check is described using a prompt and evaluated by an LLM judge. 

In the web app, you can do this using the Create Check button. You can type a description of the check you want, and Okareo will use an LLM to immediately generate the check in Python code, which can then be used to evaluate the check automatically at evaluation time.

Below is an example of a user-created check that verifies whether the output is valid JSON with the properties short_summary, actions, and attendee_list. Okareo has generated the Python code for deterministically checking this. Once you’ve created a check, you can click on it and a modal will appear showing similar code.

Screenshot of the Python code generated by Okareo for the JSON check.

Once all your checks have been defined in Okareo (either in-app or programmatically), you can run an evaluation with your custom checks using flows (scripts written in Python or TypeScript) and config files. 

Running an LLM evaluation using an Okareo flow

Below is an example of an Okareo flow for running an LLM evaluation that you can follow along with. You'll need to install the Okareo CLI locally to run your flows.

This particular example is in TypeScript and the full code for this example is available on our GitHub. The example applies non-deterministic checks for coherence, consistency, fluency, and relevance to the output of a meeting summarizer app using an LLM judge. 

The code below shows that there are three main steps to running an LLM evaluation in Okareo:

  1. Create a scenario set: This is an array of inputs, each paired with a result. For reference-based metrics, the result is the gold-standard reference. For reference-free metrics, you can just pass in an empty string.

  2. Register your model with Okareo: This example uses OpenAI's GPT-3.5 Turbo model.

  3. Run the evaluation: Use the run_test method to run the evaluation, passing in the array of checks you want to use.

import {
Okareo,
RunTestProps,
components,
SeedData,
      TestRunType,
      OpenAIModel,
      GenerationReporter,
} from "okareo-ts-sdk";

const OKAREO_API_KEY = process.env.OKAREO_API_KEY;
const OPENAI_API_KEY = process.env.OPENAI_API_KEY;

const UNIQUE_BUILD_ID = (process.env.DEMO_BUILD_ID || `local.${(Math.random() + 1).toString(36).substring(7)}`);

const PROJECT_NAME = "Global";
const MODEL_NAME = "Text Summarizer";
const SCENARIO_SET_NAME = "Webbizz Articles for Text Summarization";

const USER_PROMPT_TEMPLATE = "{scenario_input}"
const SUMMARIZATION_CONTEXT_TEMPLATE = "You will be provided with text. Summarize the text in 1 simple sentence."

const main = async () => {

const okareo = new Okareo({api_key: OKAREO_API_KEY });
const project: any[] = await okareo.getProjects();
const project_id = project.find(p => p.name === PROJECT_NAME)?.id;

    // 1. Create scenario set
    const SCENARIO_SET = [
    SeedData({
        input:"WebBizz is dedicated to providing our customers with a seamless online shopping experience. Our platform is designed with user-friendly interfaces to help you browse and select the best products suitable for your needs. We offer a wide range of products from top brands and new entrants, ensuring diversity and quality in our offerings. Our 24/7 customer support is ready to assist you with any queries, from product details, shipping timelines, to payment methods. We also have a dedicated FAQ section addressing common concerns. Always ensure you are logged in to enjoy personalized product recommendations and faster checkout processes.",
            result:"WebBizz offers a diverse, user-friendly online shopping experience with 24/7 customer support and personalized features for a seamless purchase journey." 
    }),
    SeedData({
        input:"Safety and security of your data is our top priority at WebBizz. Our platform employs state-of-the-art encryption methods ensuring your personal and financial information remains confidential. Our two-factor authentication at checkout provides an added layer of security. We understand the importance of timely deliveries, hence we've partnered with reliable logistics partners ensuring your products reach you in pristine condition. In case of any delays or issues, our tracking tool can provide real-time updates on your product's location. We believe in transparency and guarantee no hidden fees or charges during your purchase journey.", 
        result:"WebBizz prioritizes data security, using encryption and two-factor authentication, while ensuring timely deliveries with real-time tracking and no hidden fees."
    }),
    SeedData({
        input:"WebBizz places immense value on its dedicated clientele, recognizing their loyalty through the exclusive 'Premium Club' membership. This special program is designed to enrich the shopping experience, providing a suite of benefits tailored to our valued members. Among the advantages, members enjoy complimentary shipping, granting them a seamless and cost-effective way to receive their purchases. Additionally, the 'Premium Club' offers early access to sales, allowing members to avail themselves of promotional offers before they are opened to the general public.",
        result:"WebBizz rewards loyal customers with its 'Premium Club' membership, offering free shipping and early access to sales for an enhanced shopping experience."
    })
];

    const scenario: any = await okareo.create_scenario_set(
        {
        name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
        project_id: project_id,
            seed_data: SCENARIO_SET
        }
    );


    // 2. Register your LLM with Okareo
    const model = await okareo.register_model({
name: MODEL_NAME,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
models: {
type: "openai",
model_id:"gpt-3.5-turbo",
temperature:0.5,
system_prompt_template:SUMMARIZATION_CONTEXT_TEMPLATE,
user_prompt_template:USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});

    // 3. Run your LLM evaluation
const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: `${MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: scenario,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: [
"coherence_summary",
"consistency_summary",
"fluency_summary",
"relevance_summary"
]
} as RunTestProps);

// Print a direct link to the evaluation report in Okareo (for convenience)
console.log(`See results in Okareo: ${eval_run.app_link}`);

}
main();

You can run your flow with the okareo run -f <YOUR_FLOW_SCRIPT_NAME> command, and then click on the printed link to view graphs of the results of your checks in Okareo’s web interface so you can visualize the distributions. This can help you better understand the behavior of your app, as discussed above.

Screenshot of results from the LLM-as-a-judge checks for coherence, consistency, fluency, and relevance.

If you want to automate things even further, you can also easily integrate your Okareo LLM evaluation flow scripts into your existing CI/CD system.

LLM evaluation metrics will help you improve your app

LLM evaluation metrics will enable you to explore and understand the behavior of your app and therefore help you improve it.

It’s important to evaluate your LLM app in a way that’s quantifiable, objective, and easily automated. Okareo provides a wide variety of LLM evaluation metrics for different use cases, as well as the ability to define your own. You can try Okareo for free here.

To verify that your LLMs or LLM-based apps are performing correctly, you need objective evaluation metrics. This article explores the kind of LLM evaluation metrics that exist and which ones are right for your use case. It explains how to use Okareo to evaluate your LLM according to these metrics and others of your choice.

Why you need LLM evaluation metrics

As with any other apps, you need to evaluate LLM-based apps to ensure they're achieving the purpose for which you implemented them, and to make sure the model’s output doesn't get worse when you change something (for example, replacing the model, retraining it, or changing your system prompt). In the case of LLMs in particular, you also need to be on guard against model drift, where the model’s output gets worse over time even without you changing anything.

You also need metrics that are objective and quantitative, so that you can:

  • evaluate your LLM app in a consistent and unbiased way.

  • measure changes over time.

  • correlate changes (such as improved or worsened output) with changes you make (for example, to your prompts).

  • compare one model with another.

Finally, the metrics need to be easily automatable so you can schedule regular, frequent checks and see how things change over time, and so that you can integrate the metrics into your existing testing infrastructure, including CI/CD. 

Types of LLM evaluation metrics

You can divide LLM evaluation metrics into reference-based and reference-free metrics. Reference-based metrics compare your LLM’s output against a “gold standard,” or example of expected output. For example, BLEU score (which is commonly used in translation tasks) measures the overlap of n-grams between the text generated by an LLM and some reference text. By contrast, reference-free metrics evaluate the output in isolation. For example, measuring consistency can tell you how well your model generates stable, reliable responses across similar inputs without needing an expected output to compare against.

Separately, you can also divide them into deterministic and non-deterministic metrics. 

Deterministic metrics, such as word count or character count, are straightforward and clear-cut: if you measure the same output multiple times, the result will always be the same. They also include structural validation metrics for code generation, which check that the output is in a specific format: for example, that it conforms to the JSON specification, or that the JSON contains certain required terms. 

Non-deterministic metrics, such as the coherence or friendliness of a piece of generated text, are usually more subjective. These previously required a human judge, but now they can be judged by another LLM.

A table showing the differences between reference-based and reference-free metrics, and deterministic and non-deterministic metrics.

There are also metrics that measure efficiency or performance: for example, latency or inference time, or usage of memory or other resources. These are external to the output, unlike the metrics above, which all measure properties of the output.

Which LLM evaluation metrics should you use?

The evaluation metrics you use will depend on your use case and what you want to measure.

Reference-based vs. reference-free metrics

Reference-free metrics are often more versatile than reference-based ones, because reference-based metrics need a “gold standard,” or ground truth, to compare against, and these need to come from somewhere. This need increases overheads and limits the volume of testing you can do. It also limits the kinds of things you can test if you’re only testing similarity (in whatever sense) to a gold standard document. However, for some use cases, this may be what you want. For example, if you’re building an app for translation or summarization, you will want to check that your output is sufficiently similar in meaning to the original.

With reference-free metrics, you can evaluate your output in isolation. There’s no gold standard document needed, so you don't need a human in the loop to create it or evaluate its suitability. Reference-free metrics also increase the range of things you can test: for example, friendliness, politeness, or fluency, which are properties of the output on its own, not in reference to another document. 

Reference-free metrics also allow for testing at much higher volumes, through strategies like creating an initial dataset of test cases and using it as a seed to generate large numbers of similar test cases.

Deterministic vs. non-deterministic metrics

Deterministic metrics are good for any property of the output that can be defined programmatically. If the output is plain text, this could be word or character counts. If you’re generating code, you might want to check that it’s valid JSON or syntactically correct Python. You could then check for the presence or quantity of any given properties in the generated JSON or function calls or arguments in the generated Python code.

Non-deterministic metrics are good for almost everything else. For example, you could evaluate the output for fluency (correct spelling, punctuation, and grammar) or for coherence (how good the structure and organization are). You could check the relevance of the output to the user’s query, or you could check the consistency between the input and the output, using entailment-based metrics to determine whether the output text entails, contradicts, or undermines the premise, and thus detect inconsistency.

It’s important to use the right tool for the job. Non-deterministic metrics are exciting because they’re more advanced and sophisticated than deterministic ones, but if what you care about is the length of the output or whether it contains certain keywords, you don’t need them or the extra overhead they come with in the form of the second LLM that’s used to judge the quality of your LLM.

Using LLM evaluation metrics for exploratory data analysis

Although LLM evaluation metrics are useful for CI/CD and for regression testing — to make sure the output of your model isn’t getting worse due to changes you’ve made or over time due to model drift — that's not their only purpose.

You can also use them as an exploratory tool to get a sense of the distribution of your app’s output, particularly during the early stages of building your app. This allows you to understand the behavior of your app better. In particular, it gives you insight into whether small changes in your system prompt or other settings are likely to lead to small or large changes in the output. This means you can gauge its robustness (if you’re broadly happy with its behavior and don’t want it to change much) — or, if you do want it to change, you can gauge how achievable that is.

Continuous metrics are the most effective type for exploratory data analysis, as opposed to binary pass/fail metrics that are not granular enough to convey the distribution of things like text length, verbosity, or conciseness. 

Consider an LLM app developer who is using both a “word count” metric and a “below 256 characters?” metric in their test suite. That could seem redundant at first glance, but “below 256 characters?” is a binary pass/fail metric and doesn’t give you a sense of the distribution of the length of the output. If it meets a given character count threshold half the time, that doesn’t tell you whether the results are all tightly clustered either side of that threshold, normally distributed around the threshold, heavily skewed, or randomly and uniformly distributed all over the place.

Some possible distributions of the length of the output from an LLM app.

It’s useful to know which of these your output resembles. If it’s normal or uniform, a small change to the model or the system prompt probably won’t make much difference to your binary pass/fail metric; but if the distribution is tightly clustered or skewed, a small change could make the difference between mostly passing and mostly failing.

Also, since LLMs work on tokens, they operate at a word or part-word level — they can’t directly see or control the number of characters in their output. For example, you may be interested in measuring both the word count and the character count; if you sometimes use a system prompt that encourages formal writing and longer words and other times use one that encourages short, simple words, then the word count may be poorly correlated with the character count.

Creating LLM evaluation metrics using Okareo

Okareo tests all the metrics discussed above out of the box, and also lets you define your own custom checks. You can create any type of metrics with Okareo, including ones that use LLM as a judge, such as friendliness, and ones you define yourself.

The simplest way to experiment is in the Okareo web app. Go to Checks and browse the list of built-in options, which includes examples of different kinds of checks: deterministic ones like is_json or does_code_compile,and non-deterministic ones like coherence or fluency.

You can also create your own checks in Okareo. A check can be either a CodeBasedCheck deterministically implemented in code, or a ModelBasedCheck, where the check is described using a prompt and evaluated by an LLM judge. 

In the web app, you can do this using the Create Check button. You can type a description of the check you want, and Okareo will use an LLM to immediately generate the check in Python code, which can then be used to evaluate the check automatically at evaluation time.

Below is an example of a user-created check that verifies whether the output is valid JSON with the properties short_summary, actions, and attendee_list. Okareo has generated the Python code for deterministically checking this. Once you’ve created a check, you can click on it and a modal will appear showing similar code.

Screenshot of the Python code generated by Okareo for the JSON check.

Once all your checks have been defined in Okareo (either in-app or programmatically), you can run an evaluation with your custom checks using flows (scripts written in Python or TypeScript) and config files. 

Running an LLM evaluation using an Okareo flow

Below is an example of an Okareo flow for running an LLM evaluation that you can follow along with. You'll need to install the Okareo CLI locally to run your flows.

This particular example is in TypeScript and the full code for this example is available on our GitHub. The example applies non-deterministic checks for coherence, consistency, fluency, and relevance to the output of a meeting summarizer app using an LLM judge. 

The code below shows that there are three main steps to running an LLM evaluation in Okareo:

  1. Create a scenario set: This is an array of inputs, each paired with a result. For reference-based metrics, the result is the gold-standard reference. For reference-free metrics, you can just pass in an empty string.

  2. Register your model with Okareo: This example uses OpenAI's GPT-3.5 Turbo model.

  3. Run the evaluation: Use the run_test method to run the evaluation, passing in the array of checks you want to use.

import {
Okareo,
RunTestProps,
components,
SeedData,
      TestRunType,
      OpenAIModel,
      GenerationReporter,
} from "okareo-ts-sdk";

const OKAREO_API_KEY = process.env.OKAREO_API_KEY;
const OPENAI_API_KEY = process.env.OPENAI_API_KEY;

const UNIQUE_BUILD_ID = (process.env.DEMO_BUILD_ID || `local.${(Math.random() + 1).toString(36).substring(7)}`);

const PROJECT_NAME = "Global";
const MODEL_NAME = "Text Summarizer";
const SCENARIO_SET_NAME = "Webbizz Articles for Text Summarization";

const USER_PROMPT_TEMPLATE = "{scenario_input}"
const SUMMARIZATION_CONTEXT_TEMPLATE = "You will be provided with text. Summarize the text in 1 simple sentence."

const main = async () => {

const okareo = new Okareo({api_key: OKAREO_API_KEY });
const project: any[] = await okareo.getProjects();
const project_id = project.find(p => p.name === PROJECT_NAME)?.id;

    // 1. Create scenario set
    const SCENARIO_SET = [
    SeedData({
        input:"WebBizz is dedicated to providing our customers with a seamless online shopping experience. Our platform is designed with user-friendly interfaces to help you browse and select the best products suitable for your needs. We offer a wide range of products from top brands and new entrants, ensuring diversity and quality in our offerings. Our 24/7 customer support is ready to assist you with any queries, from product details, shipping timelines, to payment methods. We also have a dedicated FAQ section addressing common concerns. Always ensure you are logged in to enjoy personalized product recommendations and faster checkout processes.",
            result:"WebBizz offers a diverse, user-friendly online shopping experience with 24/7 customer support and personalized features for a seamless purchase journey." 
    }),
    SeedData({
        input:"Safety and security of your data is our top priority at WebBizz. Our platform employs state-of-the-art encryption methods ensuring your personal and financial information remains confidential. Our two-factor authentication at checkout provides an added layer of security. We understand the importance of timely deliveries, hence we've partnered with reliable logistics partners ensuring your products reach you in pristine condition. In case of any delays or issues, our tracking tool can provide real-time updates on your product's location. We believe in transparency and guarantee no hidden fees or charges during your purchase journey.", 
        result:"WebBizz prioritizes data security, using encryption and two-factor authentication, while ensuring timely deliveries with real-time tracking and no hidden fees."
    }),
    SeedData({
        input:"WebBizz places immense value on its dedicated clientele, recognizing their loyalty through the exclusive 'Premium Club' membership. This special program is designed to enrich the shopping experience, providing a suite of benefits tailored to our valued members. Among the advantages, members enjoy complimentary shipping, granting them a seamless and cost-effective way to receive their purchases. Additionally, the 'Premium Club' offers early access to sales, allowing members to avail themselves of promotional offers before they are opened to the general public.",
        result:"WebBizz rewards loyal customers with its 'Premium Club' membership, offering free shipping and early access to sales for an enhanced shopping experience."
    })
];

    const scenario: any = await okareo.create_scenario_set(
        {
        name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
        project_id: project_id,
            seed_data: SCENARIO_SET
        }
    );


    // 2. Register your LLM with Okareo
    const model = await okareo.register_model({
name: MODEL_NAME,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
models: {
type: "openai",
model_id:"gpt-3.5-turbo",
temperature:0.5,
system_prompt_template:SUMMARIZATION_CONTEXT_TEMPLATE,
user_prompt_template:USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});

    // 3. Run your LLM evaluation
const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: `${MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: scenario,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: [
"coherence_summary",
"consistency_summary",
"fluency_summary",
"relevance_summary"
]
} as RunTestProps);

// Print a direct link to the evaluation report in Okareo (for convenience)
console.log(`See results in Okareo: ${eval_run.app_link}`);

}
main();

You can run your flow with the okareo run -f <YOUR_FLOW_SCRIPT_NAME> command, and then click on the printed link to view graphs of the results of your checks in Okareo’s web interface so you can visualize the distributions. This can help you better understand the behavior of your app, as discussed above.

Screenshot of results from the LLM-as-a-judge checks for coherence, consistency, fluency, and relevance.

If you want to automate things even further, you can also easily integrate your Okareo LLM evaluation flow scripts into your existing CI/CD system.

LLM evaluation metrics will help you improve your app

LLM evaluation metrics will enable you to explore and understand the behavior of your app and therefore help you improve it.

It’s important to evaluate your LLM app in a way that’s quantifiable, objective, and easily automated. Okareo provides a wide variety of LLM evaluation metrics for different use cases, as well as the ability to define your own. You can try Okareo for free here.

Share:

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.