What LLM Benchmarking Is, and Why You May Need Baselining Instead

Evaluation

Matt Wyman

,

Co-founder Okareo

Oleksii Kolchay

,

Technical Content Writer

September 13, 2024

Large language model (LLM) benchmarks are a common way to understand LLM performance at a glance and compare the performance of various LLMs. In this article, we show what an LLM benchmark consists of and why you’re better off using a custom LLM baseline, rather than a generic benchmark, when evaluating LLM performance for your use case.

In this article, we’ll show you how to create a custom LLM benchmark that focuses on the specific behaviors of an LLM that you need in your application using Okareo.

How LLM benchmarks work

An LLM benchmark is an evaluation framework for assessing LLM performance and may consist of a variety of metrics. One example of a popular LLM benchmark is the Massive Text Embedding Benchmark (MTEB). HuggingFace hosts an MTEB leaderboard, which ranks a number of LLMs from best to worst:

A screenshot of the MTEB leaderboard showing a number of LLM models and their benchmark scores.

This benchmark evaluates each model based on the average of its performance stats on different tasks, including:

  • Classification

  • Pair classification

  • Clustering

  • Reranking

  • Retrieval

  • Semantic textual similarity (STS)

  • Summarization

The tasks all use datasets, each of which is generally produced independently and for a different purpose. For example, one of the retrieval datasets used in the above benchmarks is Climate FEVER, a collection of climate-related claims and associated labels such as SUPPORTS, REFUTES, NOT_ENOUGH_INFO, and DISPUTED for each claim. Other retrieval datasets cover anything from ontology of entities sourced from Wikipedia, such as companies, cities, and countries, to arguments that can be used in a decision-making process. Click on the “Retrieval” tab to view these datasets on the leaderboard.

Datasets for other tasks are similarly varied. For example, the classification datasets cover anything from the polarity of Amazon product reviews (positive or negative) to the domain of various tasks around news, recipes, reminders, and more.

As you can see, the evaluation criteria is extremely broad, covering a wide variety of tasks, and the leaderboard above ranks LLMs based on performance for all these different tasks on all the different datasets combined.

The problem with standard LLM benchmarks: they only evaluate a “generic” use case

While generic benchmarks are useful for the development of general-purpose LLMs, they are significantly less useful if you are a developer looking to build LLM-powered functionality into your application. Even if the model bge-en-icl is ranked as the highest-performing model in the MTEB leaderboard above, that doesn’t necessarily mean that it will be the best for the kinds of tasks that you need to use an LLM for.

For example, even though that model ranks higher than the rest in the leaderboard, a different model may be better at providing more technical and detailed answers for, say, the automotive parts domain; or another model might be better at providing friendlier output that your end users prefer.

Some models may also be better at preventing security issues by disallowing prompts that can help malicious actors extract information from the system. This information isn’t reflected in the leaderboard.

If you select the wrong LLM for your use case, the impact on the end user experience may be significant. For example, you might have a difficult time having the LLM produce relevant answers to customer queries, it may retrieve incomplete information in response to your queries, or you may run into security issues more often.

What to do instead of looking at standard benchmarks: custom baselines

Instead of selecting the LLM for your task based on high-level benchmarks, you should evaluate each model in a way that’s customized for your particular use case.

To achieve this, you need custom baselines as opposed to benchmarks. A baseline is a record of the LLM application’s past performance for a particular task that you can compare to the performance of its newer version. Such comparison is usually performed by using custom LLM evaluations.

While doing a custom LLM evaluation might sound like a lot of work, at the core it consists of three straightforward concepts:

  1. Using your own data for evaluation rather than generic open-source datasets, or at least filtering or customizing the open-source datasets to your needs. This includes sample inputs for the LLM that are close to what you would expect the production inputs to be, and sample outputs that you expect to display to a user.

  2. Creating custom checks that evaluate the specific behavior that you’re looking for. The checks can include some or all of the broad measures if required, like the ones mentioned in the MTEB ranking, and incorporate more complex behaviors that are specific to your use case.

  3. Establishing a process where these custom checks run using your own custom data and where you compare the performance of the LLM application against the checks over time.

To evaluate an LLM using your own data, you need to create your own custom dataset that consists of pairs of sample inputs along with their expected results. In Okareo, we call these scenario sets. The scenarios may look different depending on the task that you’re going to be evaluating. For example, for generation tasks, an input-result pair may look as follows:

{
  "input": "How do you approach tea blending?",
  "result": "At TeaStore, we blend top-quality teas that normally would only be sold as single-origin varieties. We believe that by blending top-notch teas, we can create blends that taste even better than the blend’s components individually."
}

In general, the larger the dataset, the more complete and accurate the evaluation will be. However, with larger datasets comes higher evaluation cost, so if you are looking to evaluate frequently (which we recommend as part of the development process), you will need to find a sweet spot that gives you enough feedback on the quality of your LLM-backed system at a reasonable cost.

Most organizations can get started with evaluation for generation tasks at around a thousand data points. Okareo approaches the customer-supplied datasets as “seed” data and then generates additional data points based on variations of those using an LLM or a set of LLMs.

Picking helpful metrics for custom LLM benchmarking

Once you have custom data, including inputs and expected outputs, you need to decide how you’re going to distill the model’s performance on relevant tasks to a set of metrics that you can look at at a glance.

For generation use cases, the range of metrics can be quite broad based on what’s important for you, but in many cases, the core ones will be:

  • consistency

  • conciseness

  • relevance

  • BLEU Score

In addition to this, you might go deeper and create more specific measures, such as friendliness of the message generated by the LLM or the correctness of technical details in its responses.

How you can create a custom LLM benchmark with Okareo

Okareo is an LLM evaluation platform that helps you create custom benchmarks, track performance of models against those benchmarks over time, and create a feedback loop for AI development as a result.

Okareo offers a command-line tool, as well as Python and TypeScript SDK's, a web-based user interface and an API. Here’s how you can use Okareo to create a custom LLM benchmark for your use case using TypeScript. The process is very similar using the other approaches.

You can find the full example in the Okareo Cookbook repository on GitHub.

Step 1: Register model with Okareo

Inside your test code, tell Okareo which model you would like to test.

// tests/llm-evaluation.test.ts 
// The prompts for the model are stored in another file 
import { prompts } from "../prompts/meeting_summary"

// Register your model with Okareo 
const model = await okareo.register_model({
  name: MODEL_NAME,
  project_id: project_id,
  models: {
    type: "openai",
    model_id:"gpt-4-turbo",
    temperature:0.3,
    system_prompt_template:prompts.getCustomSystemPrompt(),
    user_prompt_template:prompts.getUserPromptTemplate(),
  } as OpenAIModel,
  update: true,
});

Step 2: Create a scenario set

Next, you’ll need to upload your custom data. You can do this using the Okareo SDK, the CLI, or the API, or in the Okareo user interface as shown below.

A screenshot of the Okareo interface showing the scenario creation page.

If you prefer the programmatic route, the data can be formatted as JSON, such as in this example JSONL file. You can then upload this data as follows:

const scenario: any = await okareo.upload_scenario_set({   name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,   file_path: "./tests/meetings.jsonl",   project_id: project_id, });

You can see this code in context here.

Step 3: Create checks

Okareo offers a number of built-in checks that are based on standard metrics that most LLM benchmarks assess. This includes metrics like coherence, consistency, and fluency of generated output. All you need to do to use them is list them by name and later pass all checks to the evaluation function.

const checks = [
     "coherence",
     "consistency",
     "fluency ",
]

Unlike standard LLM benchmarking, Okareo offers the ability to create your own custom checks, which evaluate behavior that’s very specific to your application. You define these checks in natural language, and Okareo converts them to code.

Some of these custom checks even allow you to use another LLM to evaluate the output of the LLM under test, just by supplying a natural language description and prompt. Here’s an example of a custom check that looks at whether the speakers in a meeting have a friendly tone:

const custom_checks: CHECK_TYPE[] = [{
  name:"demo.Tone.Friendly",
  description: "Use a model judgment to determine whether the tone in the meeting is friendly (true).",
  prompt: "Only output True if the speakers in the meeting are friendly; otherwise, return False.",
  output_data_type: CheckOutputType.PASS_FAIL,
}];

You can add your custom checks to your list of checks so they can all be passed to the evaluation together:

const checks = [
  "coherence",
  "consistency",
  "fluency ",
  ...custom_checks.map(c => c.name), 
]

Step 4: Run evaluation and compare to baseline

Once you’ve defined your scenario and your checks, you can run the evaluation using Okareo, passing your scenario set and checks to it:

const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
  model_api_key: OPENAI_API_KEY,
  name: `${MODEL_NAME} Eval`,
  project_id: project_id,
  scenario: scenario_set,
  calculate_metrics: true,
  type: TestRunType.NL_GENERATION,
  checks: checks
} as RunTestProps);

Each evaluation run produces a set of outputs that are visualized in the Okareo evaluation dashboard, as shown below. You can also access the evaluation run data using the SDK, the API, and the CLI if you prefer.

You can then compare the evaluation run results to the past results and decide whether the change that was being evaluated actually made the application better or worse for your needs.

A screenshot of the Okareo interface visualizing the outcomes of an evaluation run.

Sign up for Okareo to create your custom LLM benchmark

In this article, we’ve shown how you can use Okareo to create a custom LLM benchmark that focuses on the specific behaviors of an LLM that you need in your application. We recommend relying on such custom benchmarks over generic benchmarks that test models on generic inputs. Running a benchmark that uses your data, which includes your expected inputs and your expected outputs, has a much higher chance of helping you catch issues and improve your LLM application over time.

Using Okareo is more straightforward than creating all the required infrastructure yourself. You also get a user interface where you can see evaluation runs at a glance. You can use Okareo through a TypeScript and Python SDK, and a CLI tool is also available for convenient use inside CI/CD workflows. In addition, all evaluation data is available via the Okareo API.

Sign up for Okareo today and give it a try, or book a demo to learn more.

Large language model (LLM) benchmarks are a common way to understand LLM performance at a glance and compare the performance of various LLMs. In this article, we show what an LLM benchmark consists of and why you’re better off using a custom LLM baseline, rather than a generic benchmark, when evaluating LLM performance for your use case.

In this article, we’ll show you how to create a custom LLM benchmark that focuses on the specific behaviors of an LLM that you need in your application using Okareo.

How LLM benchmarks work

An LLM benchmark is an evaluation framework for assessing LLM performance and may consist of a variety of metrics. One example of a popular LLM benchmark is the Massive Text Embedding Benchmark (MTEB). HuggingFace hosts an MTEB leaderboard, which ranks a number of LLMs from best to worst:

A screenshot of the MTEB leaderboard showing a number of LLM models and their benchmark scores.

This benchmark evaluates each model based on the average of its performance stats on different tasks, including:

  • Classification

  • Pair classification

  • Clustering

  • Reranking

  • Retrieval

  • Semantic textual similarity (STS)

  • Summarization

The tasks all use datasets, each of which is generally produced independently and for a different purpose. For example, one of the retrieval datasets used in the above benchmarks is Climate FEVER, a collection of climate-related claims and associated labels such as SUPPORTS, REFUTES, NOT_ENOUGH_INFO, and DISPUTED for each claim. Other retrieval datasets cover anything from ontology of entities sourced from Wikipedia, such as companies, cities, and countries, to arguments that can be used in a decision-making process. Click on the “Retrieval” tab to view these datasets on the leaderboard.

Datasets for other tasks are similarly varied. For example, the classification datasets cover anything from the polarity of Amazon product reviews (positive or negative) to the domain of various tasks around news, recipes, reminders, and more.

As you can see, the evaluation criteria is extremely broad, covering a wide variety of tasks, and the leaderboard above ranks LLMs based on performance for all these different tasks on all the different datasets combined.

The problem with standard LLM benchmarks: they only evaluate a “generic” use case

While generic benchmarks are useful for the development of general-purpose LLMs, they are significantly less useful if you are a developer looking to build LLM-powered functionality into your application. Even if the model bge-en-icl is ranked as the highest-performing model in the MTEB leaderboard above, that doesn’t necessarily mean that it will be the best for the kinds of tasks that you need to use an LLM for.

For example, even though that model ranks higher than the rest in the leaderboard, a different model may be better at providing more technical and detailed answers for, say, the automotive parts domain; or another model might be better at providing friendlier output that your end users prefer.

Some models may also be better at preventing security issues by disallowing prompts that can help malicious actors extract information from the system. This information isn’t reflected in the leaderboard.

If you select the wrong LLM for your use case, the impact on the end user experience may be significant. For example, you might have a difficult time having the LLM produce relevant answers to customer queries, it may retrieve incomplete information in response to your queries, or you may run into security issues more often.

What to do instead of looking at standard benchmarks: custom baselines

Instead of selecting the LLM for your task based on high-level benchmarks, you should evaluate each model in a way that’s customized for your particular use case.

To achieve this, you need custom baselines as opposed to benchmarks. A baseline is a record of the LLM application’s past performance for a particular task that you can compare to the performance of its newer version. Such comparison is usually performed by using custom LLM evaluations.

While doing a custom LLM evaluation might sound like a lot of work, at the core it consists of three straightforward concepts:

  1. Using your own data for evaluation rather than generic open-source datasets, or at least filtering or customizing the open-source datasets to your needs. This includes sample inputs for the LLM that are close to what you would expect the production inputs to be, and sample outputs that you expect to display to a user.

  2. Creating custom checks that evaluate the specific behavior that you’re looking for. The checks can include some or all of the broad measures if required, like the ones mentioned in the MTEB ranking, and incorporate more complex behaviors that are specific to your use case.

  3. Establishing a process where these custom checks run using your own custom data and where you compare the performance of the LLM application against the checks over time.

To evaluate an LLM using your own data, you need to create your own custom dataset that consists of pairs of sample inputs along with their expected results. In Okareo, we call these scenario sets. The scenarios may look different depending on the task that you’re going to be evaluating. For example, for generation tasks, an input-result pair may look as follows:

{
  "input": "How do you approach tea blending?",
  "result": "At TeaStore, we blend top-quality teas that normally would only be sold as single-origin varieties. We believe that by blending top-notch teas, we can create blends that taste even better than the blend’s components individually."
}

In general, the larger the dataset, the more complete and accurate the evaluation will be. However, with larger datasets comes higher evaluation cost, so if you are looking to evaluate frequently (which we recommend as part of the development process), you will need to find a sweet spot that gives you enough feedback on the quality of your LLM-backed system at a reasonable cost.

Most organizations can get started with evaluation for generation tasks at around a thousand data points. Okareo approaches the customer-supplied datasets as “seed” data and then generates additional data points based on variations of those using an LLM or a set of LLMs.

Picking helpful metrics for custom LLM benchmarking

Once you have custom data, including inputs and expected outputs, you need to decide how you’re going to distill the model’s performance on relevant tasks to a set of metrics that you can look at at a glance.

For generation use cases, the range of metrics can be quite broad based on what’s important for you, but in many cases, the core ones will be:

  • consistency

  • conciseness

  • relevance

  • BLEU Score

In addition to this, you might go deeper and create more specific measures, such as friendliness of the message generated by the LLM or the correctness of technical details in its responses.

How you can create a custom LLM benchmark with Okareo

Okareo is an LLM evaluation platform that helps you create custom benchmarks, track performance of models against those benchmarks over time, and create a feedback loop for AI development as a result.

Okareo offers a command-line tool, as well as Python and TypeScript SDK's, a web-based user interface and an API. Here’s how you can use Okareo to create a custom LLM benchmark for your use case using TypeScript. The process is very similar using the other approaches.

You can find the full example in the Okareo Cookbook repository on GitHub.

Step 1: Register model with Okareo

Inside your test code, tell Okareo which model you would like to test.

// tests/llm-evaluation.test.ts 
// The prompts for the model are stored in another file 
import { prompts } from "../prompts/meeting_summary"

// Register your model with Okareo 
const model = await okareo.register_model({
  name: MODEL_NAME,
  project_id: project_id,
  models: {
    type: "openai",
    model_id:"gpt-4-turbo",
    temperature:0.3,
    system_prompt_template:prompts.getCustomSystemPrompt(),
    user_prompt_template:prompts.getUserPromptTemplate(),
  } as OpenAIModel,
  update: true,
});

Step 2: Create a scenario set

Next, you’ll need to upload your custom data. You can do this using the Okareo SDK, the CLI, or the API, or in the Okareo user interface as shown below.

A screenshot of the Okareo interface showing the scenario creation page.

If you prefer the programmatic route, the data can be formatted as JSON, such as in this example JSONL file. You can then upload this data as follows:

const scenario: any = await okareo.upload_scenario_set({   name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,   file_path: "./tests/meetings.jsonl",   project_id: project_id, });

You can see this code in context here.

Step 3: Create checks

Okareo offers a number of built-in checks that are based on standard metrics that most LLM benchmarks assess. This includes metrics like coherence, consistency, and fluency of generated output. All you need to do to use them is list them by name and later pass all checks to the evaluation function.

const checks = [
     "coherence",
     "consistency",
     "fluency ",
]

Unlike standard LLM benchmarking, Okareo offers the ability to create your own custom checks, which evaluate behavior that’s very specific to your application. You define these checks in natural language, and Okareo converts them to code.

Some of these custom checks even allow you to use another LLM to evaluate the output of the LLM under test, just by supplying a natural language description and prompt. Here’s an example of a custom check that looks at whether the speakers in a meeting have a friendly tone:

const custom_checks: CHECK_TYPE[] = [{
  name:"demo.Tone.Friendly",
  description: "Use a model judgment to determine whether the tone in the meeting is friendly (true).",
  prompt: "Only output True if the speakers in the meeting are friendly; otherwise, return False.",
  output_data_type: CheckOutputType.PASS_FAIL,
}];

You can add your custom checks to your list of checks so they can all be passed to the evaluation together:

const checks = [
  "coherence",
  "consistency",
  "fluency ",
  ...custom_checks.map(c => c.name), 
]

Step 4: Run evaluation and compare to baseline

Once you’ve defined your scenario and your checks, you can run the evaluation using Okareo, passing your scenario set and checks to it:

const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
  model_api_key: OPENAI_API_KEY,
  name: `${MODEL_NAME} Eval`,
  project_id: project_id,
  scenario: scenario_set,
  calculate_metrics: true,
  type: TestRunType.NL_GENERATION,
  checks: checks
} as RunTestProps);

Each evaluation run produces a set of outputs that are visualized in the Okareo evaluation dashboard, as shown below. You can also access the evaluation run data using the SDK, the API, and the CLI if you prefer.

You can then compare the evaluation run results to the past results and decide whether the change that was being evaluated actually made the application better or worse for your needs.

A screenshot of the Okareo interface visualizing the outcomes of an evaluation run.

Sign up for Okareo to create your custom LLM benchmark

In this article, we’ve shown how you can use Okareo to create a custom LLM benchmark that focuses on the specific behaviors of an LLM that you need in your application. We recommend relying on such custom benchmarks over generic benchmarks that test models on generic inputs. Running a benchmark that uses your data, which includes your expected inputs and your expected outputs, has a much higher chance of helping you catch issues and improve your LLM application over time.

Using Okareo is more straightforward than creating all the required infrastructure yourself. You also get a user interface where you can see evaluation runs at a glance. You can use Okareo through a TypeScript and Python SDK, and a CLI tool is also available for convenient use inside CI/CD workflows. In addition, all evaluation data is available via the Okareo API.

Sign up for Okareo today and give it a try, or book a demo to learn more.

Large language model (LLM) benchmarks are a common way to understand LLM performance at a glance and compare the performance of various LLMs. In this article, we show what an LLM benchmark consists of and why you’re better off using a custom LLM baseline, rather than a generic benchmark, when evaluating LLM performance for your use case.

In this article, we’ll show you how to create a custom LLM benchmark that focuses on the specific behaviors of an LLM that you need in your application using Okareo.

How LLM benchmarks work

An LLM benchmark is an evaluation framework for assessing LLM performance and may consist of a variety of metrics. One example of a popular LLM benchmark is the Massive Text Embedding Benchmark (MTEB). HuggingFace hosts an MTEB leaderboard, which ranks a number of LLMs from best to worst:

A screenshot of the MTEB leaderboard showing a number of LLM models and their benchmark scores.

This benchmark evaluates each model based on the average of its performance stats on different tasks, including:

  • Classification

  • Pair classification

  • Clustering

  • Reranking

  • Retrieval

  • Semantic textual similarity (STS)

  • Summarization

The tasks all use datasets, each of which is generally produced independently and for a different purpose. For example, one of the retrieval datasets used in the above benchmarks is Climate FEVER, a collection of climate-related claims and associated labels such as SUPPORTS, REFUTES, NOT_ENOUGH_INFO, and DISPUTED for each claim. Other retrieval datasets cover anything from ontology of entities sourced from Wikipedia, such as companies, cities, and countries, to arguments that can be used in a decision-making process. Click on the “Retrieval” tab to view these datasets on the leaderboard.

Datasets for other tasks are similarly varied. For example, the classification datasets cover anything from the polarity of Amazon product reviews (positive or negative) to the domain of various tasks around news, recipes, reminders, and more.

As you can see, the evaluation criteria is extremely broad, covering a wide variety of tasks, and the leaderboard above ranks LLMs based on performance for all these different tasks on all the different datasets combined.

The problem with standard LLM benchmarks: they only evaluate a “generic” use case

While generic benchmarks are useful for the development of general-purpose LLMs, they are significantly less useful if you are a developer looking to build LLM-powered functionality into your application. Even if the model bge-en-icl is ranked as the highest-performing model in the MTEB leaderboard above, that doesn’t necessarily mean that it will be the best for the kinds of tasks that you need to use an LLM for.

For example, even though that model ranks higher than the rest in the leaderboard, a different model may be better at providing more technical and detailed answers for, say, the automotive parts domain; or another model might be better at providing friendlier output that your end users prefer.

Some models may also be better at preventing security issues by disallowing prompts that can help malicious actors extract information from the system. This information isn’t reflected in the leaderboard.

If you select the wrong LLM for your use case, the impact on the end user experience may be significant. For example, you might have a difficult time having the LLM produce relevant answers to customer queries, it may retrieve incomplete information in response to your queries, or you may run into security issues more often.

What to do instead of looking at standard benchmarks: custom baselines

Instead of selecting the LLM for your task based on high-level benchmarks, you should evaluate each model in a way that’s customized for your particular use case.

To achieve this, you need custom baselines as opposed to benchmarks. A baseline is a record of the LLM application’s past performance for a particular task that you can compare to the performance of its newer version. Such comparison is usually performed by using custom LLM evaluations.

While doing a custom LLM evaluation might sound like a lot of work, at the core it consists of three straightforward concepts:

  1. Using your own data for evaluation rather than generic open-source datasets, or at least filtering or customizing the open-source datasets to your needs. This includes sample inputs for the LLM that are close to what you would expect the production inputs to be, and sample outputs that you expect to display to a user.

  2. Creating custom checks that evaluate the specific behavior that you’re looking for. The checks can include some or all of the broad measures if required, like the ones mentioned in the MTEB ranking, and incorporate more complex behaviors that are specific to your use case.

  3. Establishing a process where these custom checks run using your own custom data and where you compare the performance of the LLM application against the checks over time.

To evaluate an LLM using your own data, you need to create your own custom dataset that consists of pairs of sample inputs along with their expected results. In Okareo, we call these scenario sets. The scenarios may look different depending on the task that you’re going to be evaluating. For example, for generation tasks, an input-result pair may look as follows:

{
  "input": "How do you approach tea blending?",
  "result": "At TeaStore, we blend top-quality teas that normally would only be sold as single-origin varieties. We believe that by blending top-notch teas, we can create blends that taste even better than the blend’s components individually."
}

In general, the larger the dataset, the more complete and accurate the evaluation will be. However, with larger datasets comes higher evaluation cost, so if you are looking to evaluate frequently (which we recommend as part of the development process), you will need to find a sweet spot that gives you enough feedback on the quality of your LLM-backed system at a reasonable cost.

Most organizations can get started with evaluation for generation tasks at around a thousand data points. Okareo approaches the customer-supplied datasets as “seed” data and then generates additional data points based on variations of those using an LLM or a set of LLMs.

Picking helpful metrics for custom LLM benchmarking

Once you have custom data, including inputs and expected outputs, you need to decide how you’re going to distill the model’s performance on relevant tasks to a set of metrics that you can look at at a glance.

For generation use cases, the range of metrics can be quite broad based on what’s important for you, but in many cases, the core ones will be:

  • consistency

  • conciseness

  • relevance

  • BLEU Score

In addition to this, you might go deeper and create more specific measures, such as friendliness of the message generated by the LLM or the correctness of technical details in its responses.

How you can create a custom LLM benchmark with Okareo

Okareo is an LLM evaluation platform that helps you create custom benchmarks, track performance of models against those benchmarks over time, and create a feedback loop for AI development as a result.

Okareo offers a command-line tool, as well as Python and TypeScript SDK's, a web-based user interface and an API. Here’s how you can use Okareo to create a custom LLM benchmark for your use case using TypeScript. The process is very similar using the other approaches.

You can find the full example in the Okareo Cookbook repository on GitHub.

Step 1: Register model with Okareo

Inside your test code, tell Okareo which model you would like to test.

// tests/llm-evaluation.test.ts 
// The prompts for the model are stored in another file 
import { prompts } from "../prompts/meeting_summary"

// Register your model with Okareo 
const model = await okareo.register_model({
  name: MODEL_NAME,
  project_id: project_id,
  models: {
    type: "openai",
    model_id:"gpt-4-turbo",
    temperature:0.3,
    system_prompt_template:prompts.getCustomSystemPrompt(),
    user_prompt_template:prompts.getUserPromptTemplate(),
  } as OpenAIModel,
  update: true,
});

Step 2: Create a scenario set

Next, you’ll need to upload your custom data. You can do this using the Okareo SDK, the CLI, or the API, or in the Okareo user interface as shown below.

A screenshot of the Okareo interface showing the scenario creation page.

If you prefer the programmatic route, the data can be formatted as JSON, such as in this example JSONL file. You can then upload this data as follows:

const scenario: any = await okareo.upload_scenario_set({   name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,   file_path: "./tests/meetings.jsonl",   project_id: project_id, });

You can see this code in context here.

Step 3: Create checks

Okareo offers a number of built-in checks that are based on standard metrics that most LLM benchmarks assess. This includes metrics like coherence, consistency, and fluency of generated output. All you need to do to use them is list them by name and later pass all checks to the evaluation function.

const checks = [
     "coherence",
     "consistency",
     "fluency ",
]

Unlike standard LLM benchmarking, Okareo offers the ability to create your own custom checks, which evaluate behavior that’s very specific to your application. You define these checks in natural language, and Okareo converts them to code.

Some of these custom checks even allow you to use another LLM to evaluate the output of the LLM under test, just by supplying a natural language description and prompt. Here’s an example of a custom check that looks at whether the speakers in a meeting have a friendly tone:

const custom_checks: CHECK_TYPE[] = [{
  name:"demo.Tone.Friendly",
  description: "Use a model judgment to determine whether the tone in the meeting is friendly (true).",
  prompt: "Only output True if the speakers in the meeting are friendly; otherwise, return False.",
  output_data_type: CheckOutputType.PASS_FAIL,
}];

You can add your custom checks to your list of checks so they can all be passed to the evaluation together:

const checks = [
  "coherence",
  "consistency",
  "fluency ",
  ...custom_checks.map(c => c.name), 
]

Step 4: Run evaluation and compare to baseline

Once you’ve defined your scenario and your checks, you can run the evaluation using Okareo, passing your scenario set and checks to it:

const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
  model_api_key: OPENAI_API_KEY,
  name: `${MODEL_NAME} Eval`,
  project_id: project_id,
  scenario: scenario_set,
  calculate_metrics: true,
  type: TestRunType.NL_GENERATION,
  checks: checks
} as RunTestProps);

Each evaluation run produces a set of outputs that are visualized in the Okareo evaluation dashboard, as shown below. You can also access the evaluation run data using the SDK, the API, and the CLI if you prefer.

You can then compare the evaluation run results to the past results and decide whether the change that was being evaluated actually made the application better or worse for your needs.

A screenshot of the Okareo interface visualizing the outcomes of an evaluation run.

Sign up for Okareo to create your custom LLM benchmark

In this article, we’ve shown how you can use Okareo to create a custom LLM benchmark that focuses on the specific behaviors of an LLM that you need in your application. We recommend relying on such custom benchmarks over generic benchmarks that test models on generic inputs. Running a benchmark that uses your data, which includes your expected inputs and your expected outputs, has a much higher chance of helping you catch issues and improve your LLM application over time.

Using Okareo is more straightforward than creating all the required infrastructure yourself. You also get a user interface where you can see evaluation runs at a glance. You can use Okareo through a TypeScript and Python SDK, and a CLI tool is also available for convenient use inside CI/CD workflows. In addition, all evaluation data is available via the Okareo API.

Sign up for Okareo today and give it a try, or book a demo to learn more.

Large language model (LLM) benchmarks are a common way to understand LLM performance at a glance and compare the performance of various LLMs. In this article, we show what an LLM benchmark consists of and why you’re better off using a custom LLM baseline, rather than a generic benchmark, when evaluating LLM performance for your use case.

In this article, we’ll show you how to create a custom LLM benchmark that focuses on the specific behaviors of an LLM that you need in your application using Okareo.

How LLM benchmarks work

An LLM benchmark is an evaluation framework for assessing LLM performance and may consist of a variety of metrics. One example of a popular LLM benchmark is the Massive Text Embedding Benchmark (MTEB). HuggingFace hosts an MTEB leaderboard, which ranks a number of LLMs from best to worst:

A screenshot of the MTEB leaderboard showing a number of LLM models and their benchmark scores.

This benchmark evaluates each model based on the average of its performance stats on different tasks, including:

  • Classification

  • Pair classification

  • Clustering

  • Reranking

  • Retrieval

  • Semantic textual similarity (STS)

  • Summarization

The tasks all use datasets, each of which is generally produced independently and for a different purpose. For example, one of the retrieval datasets used in the above benchmarks is Climate FEVER, a collection of climate-related claims and associated labels such as SUPPORTS, REFUTES, NOT_ENOUGH_INFO, and DISPUTED for each claim. Other retrieval datasets cover anything from ontology of entities sourced from Wikipedia, such as companies, cities, and countries, to arguments that can be used in a decision-making process. Click on the “Retrieval” tab to view these datasets on the leaderboard.

Datasets for other tasks are similarly varied. For example, the classification datasets cover anything from the polarity of Amazon product reviews (positive or negative) to the domain of various tasks around news, recipes, reminders, and more.

As you can see, the evaluation criteria is extremely broad, covering a wide variety of tasks, and the leaderboard above ranks LLMs based on performance for all these different tasks on all the different datasets combined.

The problem with standard LLM benchmarks: they only evaluate a “generic” use case

While generic benchmarks are useful for the development of general-purpose LLMs, they are significantly less useful if you are a developer looking to build LLM-powered functionality into your application. Even if the model bge-en-icl is ranked as the highest-performing model in the MTEB leaderboard above, that doesn’t necessarily mean that it will be the best for the kinds of tasks that you need to use an LLM for.

For example, even though that model ranks higher than the rest in the leaderboard, a different model may be better at providing more technical and detailed answers for, say, the automotive parts domain; or another model might be better at providing friendlier output that your end users prefer.

Some models may also be better at preventing security issues by disallowing prompts that can help malicious actors extract information from the system. This information isn’t reflected in the leaderboard.

If you select the wrong LLM for your use case, the impact on the end user experience may be significant. For example, you might have a difficult time having the LLM produce relevant answers to customer queries, it may retrieve incomplete information in response to your queries, or you may run into security issues more often.

What to do instead of looking at standard benchmarks: custom baselines

Instead of selecting the LLM for your task based on high-level benchmarks, you should evaluate each model in a way that’s customized for your particular use case.

To achieve this, you need custom baselines as opposed to benchmarks. A baseline is a record of the LLM application’s past performance for a particular task that you can compare to the performance of its newer version. Such comparison is usually performed by using custom LLM evaluations.

While doing a custom LLM evaluation might sound like a lot of work, at the core it consists of three straightforward concepts:

  1. Using your own data for evaluation rather than generic open-source datasets, or at least filtering or customizing the open-source datasets to your needs. This includes sample inputs for the LLM that are close to what you would expect the production inputs to be, and sample outputs that you expect to display to a user.

  2. Creating custom checks that evaluate the specific behavior that you’re looking for. The checks can include some or all of the broad measures if required, like the ones mentioned in the MTEB ranking, and incorporate more complex behaviors that are specific to your use case.

  3. Establishing a process where these custom checks run using your own custom data and where you compare the performance of the LLM application against the checks over time.

To evaluate an LLM using your own data, you need to create your own custom dataset that consists of pairs of sample inputs along with their expected results. In Okareo, we call these scenario sets. The scenarios may look different depending on the task that you’re going to be evaluating. For example, for generation tasks, an input-result pair may look as follows:

{
  "input": "How do you approach tea blending?",
  "result": "At TeaStore, we blend top-quality teas that normally would only be sold as single-origin varieties. We believe that by blending top-notch teas, we can create blends that taste even better than the blend’s components individually."
}

In general, the larger the dataset, the more complete and accurate the evaluation will be. However, with larger datasets comes higher evaluation cost, so if you are looking to evaluate frequently (which we recommend as part of the development process), you will need to find a sweet spot that gives you enough feedback on the quality of your LLM-backed system at a reasonable cost.

Most organizations can get started with evaluation for generation tasks at around a thousand data points. Okareo approaches the customer-supplied datasets as “seed” data and then generates additional data points based on variations of those using an LLM or a set of LLMs.

Picking helpful metrics for custom LLM benchmarking

Once you have custom data, including inputs and expected outputs, you need to decide how you’re going to distill the model’s performance on relevant tasks to a set of metrics that you can look at at a glance.

For generation use cases, the range of metrics can be quite broad based on what’s important for you, but in many cases, the core ones will be:

  • consistency

  • conciseness

  • relevance

  • BLEU Score

In addition to this, you might go deeper and create more specific measures, such as friendliness of the message generated by the LLM or the correctness of technical details in its responses.

How you can create a custom LLM benchmark with Okareo

Okareo is an LLM evaluation platform that helps you create custom benchmarks, track performance of models against those benchmarks over time, and create a feedback loop for AI development as a result.

Okareo offers a command-line tool, as well as Python and TypeScript SDK's, a web-based user interface and an API. Here’s how you can use Okareo to create a custom LLM benchmark for your use case using TypeScript. The process is very similar using the other approaches.

You can find the full example in the Okareo Cookbook repository on GitHub.

Step 1: Register model with Okareo

Inside your test code, tell Okareo which model you would like to test.

// tests/llm-evaluation.test.ts 
// The prompts for the model are stored in another file 
import { prompts } from "../prompts/meeting_summary"

// Register your model with Okareo 
const model = await okareo.register_model({
  name: MODEL_NAME,
  project_id: project_id,
  models: {
    type: "openai",
    model_id:"gpt-4-turbo",
    temperature:0.3,
    system_prompt_template:prompts.getCustomSystemPrompt(),
    user_prompt_template:prompts.getUserPromptTemplate(),
  } as OpenAIModel,
  update: true,
});

Step 2: Create a scenario set

Next, you’ll need to upload your custom data. You can do this using the Okareo SDK, the CLI, or the API, or in the Okareo user interface as shown below.

A screenshot of the Okareo interface showing the scenario creation page.

If you prefer the programmatic route, the data can be formatted as JSON, such as in this example JSONL file. You can then upload this data as follows:

const scenario: any = await okareo.upload_scenario_set({   name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,   file_path: "./tests/meetings.jsonl",   project_id: project_id, });

You can see this code in context here.

Step 3: Create checks

Okareo offers a number of built-in checks that are based on standard metrics that most LLM benchmarks assess. This includes metrics like coherence, consistency, and fluency of generated output. All you need to do to use them is list them by name and later pass all checks to the evaluation function.

const checks = [
     "coherence",
     "consistency",
     "fluency ",
]

Unlike standard LLM benchmarking, Okareo offers the ability to create your own custom checks, which evaluate behavior that’s very specific to your application. You define these checks in natural language, and Okareo converts them to code.

Some of these custom checks even allow you to use another LLM to evaluate the output of the LLM under test, just by supplying a natural language description and prompt. Here’s an example of a custom check that looks at whether the speakers in a meeting have a friendly tone:

const custom_checks: CHECK_TYPE[] = [{
  name:"demo.Tone.Friendly",
  description: "Use a model judgment to determine whether the tone in the meeting is friendly (true).",
  prompt: "Only output True if the speakers in the meeting are friendly; otherwise, return False.",
  output_data_type: CheckOutputType.PASS_FAIL,
}];

You can add your custom checks to your list of checks so they can all be passed to the evaluation together:

const checks = [
  "coherence",
  "consistency",
  "fluency ",
  ...custom_checks.map(c => c.name), 
]

Step 4: Run evaluation and compare to baseline

Once you’ve defined your scenario and your checks, you can run the evaluation using Okareo, passing your scenario set and checks to it:

const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
  model_api_key: OPENAI_API_KEY,
  name: `${MODEL_NAME} Eval`,
  project_id: project_id,
  scenario: scenario_set,
  calculate_metrics: true,
  type: TestRunType.NL_GENERATION,
  checks: checks
} as RunTestProps);

Each evaluation run produces a set of outputs that are visualized in the Okareo evaluation dashboard, as shown below. You can also access the evaluation run data using the SDK, the API, and the CLI if you prefer.

You can then compare the evaluation run results to the past results and decide whether the change that was being evaluated actually made the application better or worse for your needs.

A screenshot of the Okareo interface visualizing the outcomes of an evaluation run.

Sign up for Okareo to create your custom LLM benchmark

In this article, we’ve shown how you can use Okareo to create a custom LLM benchmark that focuses on the specific behaviors of an LLM that you need in your application. We recommend relying on such custom benchmarks over generic benchmarks that test models on generic inputs. Running a benchmark that uses your data, which includes your expected inputs and your expected outputs, has a much higher chance of helping you catch issues and improve your LLM application over time.

Using Okareo is more straightforward than creating all the required infrastructure yourself. You also get a user interface where you can see evaluation runs at a glance. You can use Okareo through a TypeScript and Python SDK, and a CLI tool is also available for convenient use inside CI/CD workflows. In addition, all evaluation data is available via the Okareo API.

Sign up for Okareo today and give it a try, or book a demo to learn more.

Share:

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.