How to Validate the Output of LLM-Based Products in a Reproducible Way

CI and Automation

Matt Wyman

,

Co-founder of Okareo

Sarah Barber

,

Senior Technical Content Writer

July 20, 2024

Okareo is a tool for automating output validation for LLM-based products. Here, we show how to integrate it into your CI workflow.

Validating the output of your LLM-based products is essential if you want to be confident that your application will continue to work as expected, even as you change your prompts and the ways you interact with the model.

One way to validate the output of your LLM-based product is by manually checking that it produces acceptable responses when given a set of example user prompts.

While manual validation is essential during the first stages of development, it unfortunately becomes tedious and time-consuming after a while, which tempts people to start taking shortcuts — skipping any steps they think they can get away with, or being less thorough because they don't have the time to do their validation well.

Okareo is a tool for TypeScript or Python developers that allows you to automate the process of validating your LLM-powered apps, meaning this validation can then be run over and over in a reproducible way.

You can use Okareo for validating products that use either third-party LLMs (such as OpenAI’s models), LLMs you created from scratch, or LLMs that you've fine-tuned. However, in this article we focus on the use case of validating the output of products that use third-party LLMs.

Why you should validate the output of your LLM-based product

Whatever LLM-powered product you're making, trust is everything. If your application produces unreliable or questionable output, your users will lose trust in it and stop using it. Many AI companies will have a quality assurance process in place to avoid this situation, and you may well be required to add validation that takes into account the LLM as part of this QA process. There are two types of changes that would make validation necessary:

When the LLM itself changes: When it comes to the use of third-party LLMs, they can be updated at any moment, or their internal system prompts can change, and you have no control over this. As you don't know when these changes will happen, the only way to manage this type of validation is to schedule a regular validation of the LLM (for example, on an hourly or daily basis) with the hope of catching any changes early.

When your custom system prompts change: With third-party LLMs, you only really have control over the custom system prompts that your product sends to it. You can't control the LLM itself or how the user interacts with your product, but you can control the custom system prompts. Whenever these change, this risks a change in the LLM output, so validation is needed at this stage.

Depending on your application and the model it uses, your validation process might want to check the output of your LLM-powered app against a combination of metrics such as the following:

Accuracy: Check that the LLM isn’t producing plausible-sounding results that are factually wrong in response to your prompts.

Relevance: The output of the LLM should always be relevant to the input question.

Regulatory compliance: Certain types of data (such as healthcare data or legal regulations) need to comply with relevant regulations (like HIPAA or OSHA). In these cases you should define the conformance rules that your app needs to comply with.

💡 Tip: In Okareo, you can use the "custom checks" feature to define your own conformance rules.

Correct format: LLMs can be prompted to produce output in a particular format (such as JSON or Markdown) and you can validate that the expected elements are present in all outputs. For example, if your product is prompting an LLM to generate SDK documentation in Markdown, you could use some validation checks that state that the LLM should output both ::info blocks and script blocks within the Markdown.

Semantic similarity: Some products, such as those that prompt the LLM to do text summarization, should produce output that is semantically similar to the input, and you can add checks to ensure this.

Filtering for inappropriate content: You can use validation to check that anything you don't want to be in the final output definitely isn't there. This could include anything from profanity and violence to illegal content.

The importance of making the validation of your LLM-based app output reproducible

Effective validation of the output of LLM-based apps isn't just a one-and-done thing. You need to set up a process of continuous validation, so that each time your custom system prompt changes, you automatically revalidate the output from the LLM.

A good time to perform output validation is as part of your CI workflow. This will ensure that every single change to your product (including the custom system prompts) always forces an automatic revalidation.

Until now, this has been challenging to implement, as most CI workflows are built with deterministic tools, so they're not very good at validating outputs of products that use LLMs, because LLMs tend to be non-deterministic. However, Okareo has created a solution to this problem: it allows you to automate output validation of LLM-based apps despite the fact that the outputs are non-deterministic.

Your Okareo code will be stored inside the codebase of your AI app — either in a folder called .okareo or as part of your testing suite, such as Jest. Any time you update your application code (including your custom system prompts), the associated pull request causes a CI workflow to be run, which runs your Okareo validation code, performing a number of validation checks on the output from the LLM. Once this validation passes, the code can be deployed to production.

How to validate the output of LLM-based apps in a reproducible way using Okareo

Let's explore how to use Okareo to validate the output of an app that uses an LLM, by choosing a well-known existing LLM — gpt-3.5-turbo on OpenAI. You'll need to be signed up with Okareo and OpenAI and have API keys for both. In this example, we'll be using the Okareo TypeScript SDK, but the Python SDK is also available.

Installation and configuration

Start by installing the Okareo CLI on your local machine. Follow the instructions, which includes creating a directory structure like this:

[Project]   
  .okareo
    config.yml
    flows
      <your_flow_script>.ts

You'll need to specify TypeScript as your language in config.yml and our Okareo code will be placed inside the flows directory. An Okareo flow is just a script that calls different parts of the Okareo API for the purpose of setting up and then validating an LLM.

Next, install dependencies for Okareo's TypeScript SDK and OpenAI's library. Finally, create environment variables for your API keys so they can be referenced in <your_flow_script>.ts and config.yml.

Prepare your product for validation

Before you can validate the output of your product, there are two key steps. First, you must register the model you’re using with Okareo, and then you must specify what the expected output of the LLM should be so there is something to compare the actual output against.

It's worth ensuring that your prompts are defined in a separate file so they can be easily referenced from your application and the Okareo flow file. For this use case of validating the output of LLM-powered products that use third-party LLMs, the only unit of change that you have control over is the custom system prompts — the tailored commands that your app sends to the LLM along with the user inputs.

Note: In the JSON file below, the {input} value of USER_PROMPT_TEMPLATE is a placeholder variable that will later be replaced with each individual input that you send to Okareo as part of a scenario, which is explained in detail below.

// prompts.json 
{
  "CUSTOM_SYSTEM_PROMPT": "You will be provided with text. Summarize the text in 1 simple sentence.",    
  "USER_PROMPT_TEMPLATE": "{input}" 
}

Register the model: This involves calling Okareo's register_model() function and passing it some parameters that explain the type of model you're registering, the temperature of the model (which controls the level of randomness of the output), and the context the model needs to deal with the prompts you'll send to it.

// Import the prompts (to be used later)
import * as prompts from "../../prompts.json"
// Register your model with Okareo
const model = await okareo.register_model({
  name: MODEL_NAME,
  project_id: project_id,
  models: {
    type: "openai",
    model_id:"gpt-3.5-turbo",
    temperature:0.5,
    system_prompt_template:prompts.CUSTOM_SYSTEM_PROMPT,
    user_prompt_template:prompts.USER_PROMPT_TEMPLATE,
  } as OpenAIModel,
  update: true,
});

Specify the expected outputs from the LLM: You need a test dataset of possible input user prompts paired with corresponding acceptable results that could be output by the LLM for each input. This set of test data is called a scenario in Okareo.

The simplest type of scenario is one that is manually created by you. This is known as a seed scenario, as such scenarios can be used as seeds to generate other more complex scenarios. To create a seed scenario, first define your set of input data and expected results, then pass this data to Okareo's create_scenario_set().

// Define the scenario data 
const TEST_SEED_DATA = [
  SeedData({
    input:"WebBizz is dedicated to providing our customers with a seamless online shopping experience. Our platform is designed with user-friendly interfaces to help you browse and select the best products suitable for your needs. We offer a wide range of products from top brands and new entrants, ensuring diversity and quality in our offerings. Our 24/7 customer support is ready to assist you with any queries, from product details, shipping timelines, to payment methods. We also have a dedicated FAQ section addressing common concerns. Always ensure you are logged in to enjoy personalized product recommendations and faster checkout processes.",
    result:"WebBizz offers a user-friendly online shopping platform with diverse, quality products, 24/7 customer support, and personalized recommendations for a seamless experience."
  }),
  SeedData({
    input:"Safety and security of your data is our top priority at WebBizz. Our platform employs state-of-the-art encryption methods ensuring your personal and financial information remains confidential. Our two-factor authentication at checkout provides an added layer of security. We understand the importance of timely deliveries, hence we've partnered with reliable logistics partners ensuring your products reach you in pristine condition. In case of any delays or issues, our tracking tool can provide real-time updates on your product's location. We believe in transparency and guarantee no hidden fees or charges during your purchase journey.",
    result:"WebBizz prioritizes data security with advanced encryption and two-factor authentication, ensures timely deliveries with reliable logistics, provides real-time tracking, and guarantees no hidden fees."
  }),
  SeedData({
    input:"WebBizz places immense value on its dedicated clientele, recognizing their loyalty through the exclusive 'Premium Club' membership. This special program is designed to enrich the shopping experience, providing a suite of benefits tailored to our valued members. Among the advantages, members enjoy complimentary shipping, granting them a seamless and cost-effective way to receive their purchases. Additionally, the 'Premium Club' offers early access to sales, allowing members to avail themselves of promotional offers before they are opened to the general public.",
    result:"WebBizz values its loyal customers through the exclusive 'Premium Club' membership, offering benefits like complimentary shipping and early access to sales."
  })
];
// Get the ID of your Okareo project (which is needed to create a scenario 
// for your particular project). You can find your project name in the top 
// right of the Okareo app. 
const PROJECT_NAME = "Global";
const project: any[] = await okareo.getProjects();
const project_id = project.find(p => p.name === PROJECT_NAME)?.id; 
// Create the scenario 
const scenario: any = await okareo.create_scenario_set({
  name: "Webbizz Articles for Text Summarization Scenario Set",
  project_id: project_id,
  seed_data: TEST_SEED_DATA
});


If you have a very large amount of data for your scenario, you can upload a scenario set from a file using uploadscenarioset() instead.

Validate the output from the LLM

You can validate the output from the LLM by testing that it passes certain criteria and reporting on its success or failure.

Start by calling the run_test() function, which runs validation on any registered model. You should set the type to NL_GENERATION provided you're working with a natural language model. You'll also need to pass in some checks, which are the criteria by which you want to validate the model's output (in response to your custom system prompt and scenario). Okareo has a number of built-in checks that you can use out of the box by simply passing in their names, but it's also possible to create your own custom checks for validating anything you need.

const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
  model_api_key: OPENAI_API_KEY,
  name: `${MODEL_NAME} Eval`,
  project_id: project_id,
  scenario: scenario,
  calculate_metrics: true,
  type: TestRunType.NL_GENERATION,
  checks: [
    "coherence_summary",
    "consistency_summary",
    "fluency_summary",
    "relevance_summary"
  ]
} as RunTestProps);

Each check has a scale so you can determine how well your product is performing on each metric. For Okareo's pre-baked checks, that scale is often 1–5, but you can choose your own scale for custom checks.

When you run your evaluation, this will create a detailed online report on app.okareo.com, showing statistics on how well each output scored on each check.

Reporting to the command line whether the evaluation has passed

Okareo has reporter objects that can be used to take the statistics from the evaluation and decide whether the evaluation should pass or fail overall. As this example involves text generation, we use a GenerationReporter. You can log the report to your console using reporter.log() — this will print a link to a detailed online report on app.okareo.com.

In order to use the GenerationReporter, you need to set a threshold for each check that determines whether the check passes or fails. In the example below, there is a minimum threshold of 4.0 for each check.

const report_definition = {
  metrics_min: {
    "coherence": 4.0,
    "consistency": 4.0,
    "fluency": 4.0,
    "relevance": 4.0,
  }
};

These check thresholds are then passed to the Okareo reporter:

const reporter = new GenerationReporter({
  eval_run :eval_run,
  ...report_definition,
}); reporter.log();

Finally, there are two kinds of error handling you need to add to your Okareo code. The first is to handle TypeScript runtime errors by putting all your Okareo calls inside a try/catch block and find a way to handle any errors inside there. The other is to handle situations where there are no coding errors but the validation simply failed. For this, you just need to state what happens if the report did not pass:

if (!reporter.pass) {
  //handle error 
}

All this code together can make up one Okareo flow script, and the full code example of this can be found on our GitHub account. To run it, simply use the okareo run command, and you will be able to see if your validation passed or failed.

Adding an Okareo LLM validation to your CI workflow

As long as you have an Okareo flow script file that validates output from LLM-based products, with proper error handling in place, you're ready to begin integrating this into your CI workflow.

The first step is to add your API keys as environment variables in your CI provider. Once this is done, you'll need to create a CI workflow file that installs Okareo and then runs your flow script whenever a push or pull request happens on the main branch of your version control. You can follow our step-by-step guide on integrating Okareo into GitHub Actions for more details on this.

Adding Okareo to your CI workflow increases confidence in your LLM-based products

Using Okareo to automate the validation of your product's LLM's output whenever your custom system prompts change means you can be more confident that your LLM-based product is working as expected. It also means you can stop doing manual validation of the output from the LLM, which is boring, time-consuming and error-prone.

With the time you'll get back, you can focus your energy on building new and better models, and once you integrate Okareo into your CI workflow, your development speed will increase massively, and you'll be able to get new changes deployed much faster.

Okareo is a tool for automating output validation for LLM-based products. Here, we show how to integrate it into your CI workflow.

Validating the output of your LLM-based products is essential if you want to be confident that your application will continue to work as expected, even as you change your prompts and the ways you interact with the model.

One way to validate the output of your LLM-based product is by manually checking that it produces acceptable responses when given a set of example user prompts.

While manual validation is essential during the first stages of development, it unfortunately becomes tedious and time-consuming after a while, which tempts people to start taking shortcuts — skipping any steps they think they can get away with, or being less thorough because they don't have the time to do their validation well.

Okareo is a tool for TypeScript or Python developers that allows you to automate the process of validating your LLM-powered apps, meaning this validation can then be run over and over in a reproducible way.

You can use Okareo for validating products that use either third-party LLMs (such as OpenAI’s models), LLMs you created from scratch, or LLMs that you've fine-tuned. However, in this article we focus on the use case of validating the output of products that use third-party LLMs.

Why you should validate the output of your LLM-based product

Whatever LLM-powered product you're making, trust is everything. If your application produces unreliable or questionable output, your users will lose trust in it and stop using it. Many AI companies will have a quality assurance process in place to avoid this situation, and you may well be required to add validation that takes into account the LLM as part of this QA process. There are two types of changes that would make validation necessary:

When the LLM itself changes: When it comes to the use of third-party LLMs, they can be updated at any moment, or their internal system prompts can change, and you have no control over this. As you don't know when these changes will happen, the only way to manage this type of validation is to schedule a regular validation of the LLM (for example, on an hourly or daily basis) with the hope of catching any changes early.

When your custom system prompts change: With third-party LLMs, you only really have control over the custom system prompts that your product sends to it. You can't control the LLM itself or how the user interacts with your product, but you can control the custom system prompts. Whenever these change, this risks a change in the LLM output, so validation is needed at this stage.

Depending on your application and the model it uses, your validation process might want to check the output of your LLM-powered app against a combination of metrics such as the following:

Accuracy: Check that the LLM isn’t producing plausible-sounding results that are factually wrong in response to your prompts.

Relevance: The output of the LLM should always be relevant to the input question.

Regulatory compliance: Certain types of data (such as healthcare data or legal regulations) need to comply with relevant regulations (like HIPAA or OSHA). In these cases you should define the conformance rules that your app needs to comply with.

💡 Tip: In Okareo, you can use the "custom checks" feature to define your own conformance rules.

Correct format: LLMs can be prompted to produce output in a particular format (such as JSON or Markdown) and you can validate that the expected elements are present in all outputs. For example, if your product is prompting an LLM to generate SDK documentation in Markdown, you could use some validation checks that state that the LLM should output both ::info blocks and script blocks within the Markdown.

Semantic similarity: Some products, such as those that prompt the LLM to do text summarization, should produce output that is semantically similar to the input, and you can add checks to ensure this.

Filtering for inappropriate content: You can use validation to check that anything you don't want to be in the final output definitely isn't there. This could include anything from profanity and violence to illegal content.

The importance of making the validation of your LLM-based app output reproducible

Effective validation of the output of LLM-based apps isn't just a one-and-done thing. You need to set up a process of continuous validation, so that each time your custom system prompt changes, you automatically revalidate the output from the LLM.

A good time to perform output validation is as part of your CI workflow. This will ensure that every single change to your product (including the custom system prompts) always forces an automatic revalidation.

Until now, this has been challenging to implement, as most CI workflows are built with deterministic tools, so they're not very good at validating outputs of products that use LLMs, because LLMs tend to be non-deterministic. However, Okareo has created a solution to this problem: it allows you to automate output validation of LLM-based apps despite the fact that the outputs are non-deterministic.

Your Okareo code will be stored inside the codebase of your AI app — either in a folder called .okareo or as part of your testing suite, such as Jest. Any time you update your application code (including your custom system prompts), the associated pull request causes a CI workflow to be run, which runs your Okareo validation code, performing a number of validation checks on the output from the LLM. Once this validation passes, the code can be deployed to production.

How to validate the output of LLM-based apps in a reproducible way using Okareo

Let's explore how to use Okareo to validate the output of an app that uses an LLM, by choosing a well-known existing LLM — gpt-3.5-turbo on OpenAI. You'll need to be signed up with Okareo and OpenAI and have API keys for both. In this example, we'll be using the Okareo TypeScript SDK, but the Python SDK is also available.

Installation and configuration

Start by installing the Okareo CLI on your local machine. Follow the instructions, which includes creating a directory structure like this:

[Project]   
  .okareo
    config.yml
    flows
      <your_flow_script>.ts

You'll need to specify TypeScript as your language in config.yml and our Okareo code will be placed inside the flows directory. An Okareo flow is just a script that calls different parts of the Okareo API for the purpose of setting up and then validating an LLM.

Next, install dependencies for Okareo's TypeScript SDK and OpenAI's library. Finally, create environment variables for your API keys so they can be referenced in <your_flow_script>.ts and config.yml.

Prepare your product for validation

Before you can validate the output of your product, there are two key steps. First, you must register the model you’re using with Okareo, and then you must specify what the expected output of the LLM should be so there is something to compare the actual output against.

It's worth ensuring that your prompts are defined in a separate file so they can be easily referenced from your application and the Okareo flow file. For this use case of validating the output of LLM-powered products that use third-party LLMs, the only unit of change that you have control over is the custom system prompts — the tailored commands that your app sends to the LLM along with the user inputs.

Note: In the JSON file below, the {input} value of USER_PROMPT_TEMPLATE is a placeholder variable that will later be replaced with each individual input that you send to Okareo as part of a scenario, which is explained in detail below.

// prompts.json 
{
  "CUSTOM_SYSTEM_PROMPT": "You will be provided with text. Summarize the text in 1 simple sentence.",    
  "USER_PROMPT_TEMPLATE": "{input}" 
}

Register the model: This involves calling Okareo's register_model() function and passing it some parameters that explain the type of model you're registering, the temperature of the model (which controls the level of randomness of the output), and the context the model needs to deal with the prompts you'll send to it.

// Import the prompts (to be used later)
import * as prompts from "../../prompts.json"
// Register your model with Okareo
const model = await okareo.register_model({
  name: MODEL_NAME,
  project_id: project_id,
  models: {
    type: "openai",
    model_id:"gpt-3.5-turbo",
    temperature:0.5,
    system_prompt_template:prompts.CUSTOM_SYSTEM_PROMPT,
    user_prompt_template:prompts.USER_PROMPT_TEMPLATE,
  } as OpenAIModel,
  update: true,
});

Specify the expected outputs from the LLM: You need a test dataset of possible input user prompts paired with corresponding acceptable results that could be output by the LLM for each input. This set of test data is called a scenario in Okareo.

The simplest type of scenario is one that is manually created by you. This is known as a seed scenario, as such scenarios can be used as seeds to generate other more complex scenarios. To create a seed scenario, first define your set of input data and expected results, then pass this data to Okareo's create_scenario_set().

// Define the scenario data 
const TEST_SEED_DATA = [
  SeedData({
    input:"WebBizz is dedicated to providing our customers with a seamless online shopping experience. Our platform is designed with user-friendly interfaces to help you browse and select the best products suitable for your needs. We offer a wide range of products from top brands and new entrants, ensuring diversity and quality in our offerings. Our 24/7 customer support is ready to assist you with any queries, from product details, shipping timelines, to payment methods. We also have a dedicated FAQ section addressing common concerns. Always ensure you are logged in to enjoy personalized product recommendations and faster checkout processes.",
    result:"WebBizz offers a user-friendly online shopping platform with diverse, quality products, 24/7 customer support, and personalized recommendations for a seamless experience."
  }),
  SeedData({
    input:"Safety and security of your data is our top priority at WebBizz. Our platform employs state-of-the-art encryption methods ensuring your personal and financial information remains confidential. Our two-factor authentication at checkout provides an added layer of security. We understand the importance of timely deliveries, hence we've partnered with reliable logistics partners ensuring your products reach you in pristine condition. In case of any delays or issues, our tracking tool can provide real-time updates on your product's location. We believe in transparency and guarantee no hidden fees or charges during your purchase journey.",
    result:"WebBizz prioritizes data security with advanced encryption and two-factor authentication, ensures timely deliveries with reliable logistics, provides real-time tracking, and guarantees no hidden fees."
  }),
  SeedData({
    input:"WebBizz places immense value on its dedicated clientele, recognizing their loyalty through the exclusive 'Premium Club' membership. This special program is designed to enrich the shopping experience, providing a suite of benefits tailored to our valued members. Among the advantages, members enjoy complimentary shipping, granting them a seamless and cost-effective way to receive their purchases. Additionally, the 'Premium Club' offers early access to sales, allowing members to avail themselves of promotional offers before they are opened to the general public.",
    result:"WebBizz values its loyal customers through the exclusive 'Premium Club' membership, offering benefits like complimentary shipping and early access to sales."
  })
];
// Get the ID of your Okareo project (which is needed to create a scenario 
// for your particular project). You can find your project name in the top 
// right of the Okareo app. 
const PROJECT_NAME = "Global";
const project: any[] = await okareo.getProjects();
const project_id = project.find(p => p.name === PROJECT_NAME)?.id; 
// Create the scenario 
const scenario: any = await okareo.create_scenario_set({
  name: "Webbizz Articles for Text Summarization Scenario Set",
  project_id: project_id,
  seed_data: TEST_SEED_DATA
});


If you have a very large amount of data for your scenario, you can upload a scenario set from a file using uploadscenarioset() instead.

Validate the output from the LLM

You can validate the output from the LLM by testing that it passes certain criteria and reporting on its success or failure.

Start by calling the run_test() function, which runs validation on any registered model. You should set the type to NL_GENERATION provided you're working with a natural language model. You'll also need to pass in some checks, which are the criteria by which you want to validate the model's output (in response to your custom system prompt and scenario). Okareo has a number of built-in checks that you can use out of the box by simply passing in their names, but it's also possible to create your own custom checks for validating anything you need.

const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
  model_api_key: OPENAI_API_KEY,
  name: `${MODEL_NAME} Eval`,
  project_id: project_id,
  scenario: scenario,
  calculate_metrics: true,
  type: TestRunType.NL_GENERATION,
  checks: [
    "coherence_summary",
    "consistency_summary",
    "fluency_summary",
    "relevance_summary"
  ]
} as RunTestProps);

Each check has a scale so you can determine how well your product is performing on each metric. For Okareo's pre-baked checks, that scale is often 1–5, but you can choose your own scale for custom checks.

When you run your evaluation, this will create a detailed online report on app.okareo.com, showing statistics on how well each output scored on each check.

Reporting to the command line whether the evaluation has passed

Okareo has reporter objects that can be used to take the statistics from the evaluation and decide whether the evaluation should pass or fail overall. As this example involves text generation, we use a GenerationReporter. You can log the report to your console using reporter.log() — this will print a link to a detailed online report on app.okareo.com.

In order to use the GenerationReporter, you need to set a threshold for each check that determines whether the check passes or fails. In the example below, there is a minimum threshold of 4.0 for each check.

const report_definition = {
  metrics_min: {
    "coherence": 4.0,
    "consistency": 4.0,
    "fluency": 4.0,
    "relevance": 4.0,
  }
};

These check thresholds are then passed to the Okareo reporter:

const reporter = new GenerationReporter({
  eval_run :eval_run,
  ...report_definition,
}); reporter.log();

Finally, there are two kinds of error handling you need to add to your Okareo code. The first is to handle TypeScript runtime errors by putting all your Okareo calls inside a try/catch block and find a way to handle any errors inside there. The other is to handle situations where there are no coding errors but the validation simply failed. For this, you just need to state what happens if the report did not pass:

if (!reporter.pass) {
  //handle error 
}

All this code together can make up one Okareo flow script, and the full code example of this can be found on our GitHub account. To run it, simply use the okareo run command, and you will be able to see if your validation passed or failed.

Adding an Okareo LLM validation to your CI workflow

As long as you have an Okareo flow script file that validates output from LLM-based products, with proper error handling in place, you're ready to begin integrating this into your CI workflow.

The first step is to add your API keys as environment variables in your CI provider. Once this is done, you'll need to create a CI workflow file that installs Okareo and then runs your flow script whenever a push or pull request happens on the main branch of your version control. You can follow our step-by-step guide on integrating Okareo into GitHub Actions for more details on this.

Adding Okareo to your CI workflow increases confidence in your LLM-based products

Using Okareo to automate the validation of your product's LLM's output whenever your custom system prompts change means you can be more confident that your LLM-based product is working as expected. It also means you can stop doing manual validation of the output from the LLM, which is boring, time-consuming and error-prone.

With the time you'll get back, you can focus your energy on building new and better models, and once you integrate Okareo into your CI workflow, your development speed will increase massively, and you'll be able to get new changes deployed much faster.

Okareo is a tool for automating output validation for LLM-based products. Here, we show how to integrate it into your CI workflow.

Validating the output of your LLM-based products is essential if you want to be confident that your application will continue to work as expected, even as you change your prompts and the ways you interact with the model.

One way to validate the output of your LLM-based product is by manually checking that it produces acceptable responses when given a set of example user prompts.

While manual validation is essential during the first stages of development, it unfortunately becomes tedious and time-consuming after a while, which tempts people to start taking shortcuts — skipping any steps they think they can get away with, or being less thorough because they don't have the time to do their validation well.

Okareo is a tool for TypeScript or Python developers that allows you to automate the process of validating your LLM-powered apps, meaning this validation can then be run over and over in a reproducible way.

You can use Okareo for validating products that use either third-party LLMs (such as OpenAI’s models), LLMs you created from scratch, or LLMs that you've fine-tuned. However, in this article we focus on the use case of validating the output of products that use third-party LLMs.

Why you should validate the output of your LLM-based product

Whatever LLM-powered product you're making, trust is everything. If your application produces unreliable or questionable output, your users will lose trust in it and stop using it. Many AI companies will have a quality assurance process in place to avoid this situation, and you may well be required to add validation that takes into account the LLM as part of this QA process. There are two types of changes that would make validation necessary:

When the LLM itself changes: When it comes to the use of third-party LLMs, they can be updated at any moment, or their internal system prompts can change, and you have no control over this. As you don't know when these changes will happen, the only way to manage this type of validation is to schedule a regular validation of the LLM (for example, on an hourly or daily basis) with the hope of catching any changes early.

When your custom system prompts change: With third-party LLMs, you only really have control over the custom system prompts that your product sends to it. You can't control the LLM itself or how the user interacts with your product, but you can control the custom system prompts. Whenever these change, this risks a change in the LLM output, so validation is needed at this stage.

Depending on your application and the model it uses, your validation process might want to check the output of your LLM-powered app against a combination of metrics such as the following:

Accuracy: Check that the LLM isn’t producing plausible-sounding results that are factually wrong in response to your prompts.

Relevance: The output of the LLM should always be relevant to the input question.

Regulatory compliance: Certain types of data (such as healthcare data or legal regulations) need to comply with relevant regulations (like HIPAA or OSHA). In these cases you should define the conformance rules that your app needs to comply with.

💡 Tip: In Okareo, you can use the "custom checks" feature to define your own conformance rules.

Correct format: LLMs can be prompted to produce output in a particular format (such as JSON or Markdown) and you can validate that the expected elements are present in all outputs. For example, if your product is prompting an LLM to generate SDK documentation in Markdown, you could use some validation checks that state that the LLM should output both ::info blocks and script blocks within the Markdown.

Semantic similarity: Some products, such as those that prompt the LLM to do text summarization, should produce output that is semantically similar to the input, and you can add checks to ensure this.

Filtering for inappropriate content: You can use validation to check that anything you don't want to be in the final output definitely isn't there. This could include anything from profanity and violence to illegal content.

The importance of making the validation of your LLM-based app output reproducible

Effective validation of the output of LLM-based apps isn't just a one-and-done thing. You need to set up a process of continuous validation, so that each time your custom system prompt changes, you automatically revalidate the output from the LLM.

A good time to perform output validation is as part of your CI workflow. This will ensure that every single change to your product (including the custom system prompts) always forces an automatic revalidation.

Until now, this has been challenging to implement, as most CI workflows are built with deterministic tools, so they're not very good at validating outputs of products that use LLMs, because LLMs tend to be non-deterministic. However, Okareo has created a solution to this problem: it allows you to automate output validation of LLM-based apps despite the fact that the outputs are non-deterministic.

Your Okareo code will be stored inside the codebase of your AI app — either in a folder called .okareo or as part of your testing suite, such as Jest. Any time you update your application code (including your custom system prompts), the associated pull request causes a CI workflow to be run, which runs your Okareo validation code, performing a number of validation checks on the output from the LLM. Once this validation passes, the code can be deployed to production.

How to validate the output of LLM-based apps in a reproducible way using Okareo

Let's explore how to use Okareo to validate the output of an app that uses an LLM, by choosing a well-known existing LLM — gpt-3.5-turbo on OpenAI. You'll need to be signed up with Okareo and OpenAI and have API keys for both. In this example, we'll be using the Okareo TypeScript SDK, but the Python SDK is also available.

Installation and configuration

Start by installing the Okareo CLI on your local machine. Follow the instructions, which includes creating a directory structure like this:

[Project]   
  .okareo
    config.yml
    flows
      <your_flow_script>.ts

You'll need to specify TypeScript as your language in config.yml and our Okareo code will be placed inside the flows directory. An Okareo flow is just a script that calls different parts of the Okareo API for the purpose of setting up and then validating an LLM.

Next, install dependencies for Okareo's TypeScript SDK and OpenAI's library. Finally, create environment variables for your API keys so they can be referenced in <your_flow_script>.ts and config.yml.

Prepare your product for validation

Before you can validate the output of your product, there are two key steps. First, you must register the model you’re using with Okareo, and then you must specify what the expected output of the LLM should be so there is something to compare the actual output against.

It's worth ensuring that your prompts are defined in a separate file so they can be easily referenced from your application and the Okareo flow file. For this use case of validating the output of LLM-powered products that use third-party LLMs, the only unit of change that you have control over is the custom system prompts — the tailored commands that your app sends to the LLM along with the user inputs.

Note: In the JSON file below, the {input} value of USER_PROMPT_TEMPLATE is a placeholder variable that will later be replaced with each individual input that you send to Okareo as part of a scenario, which is explained in detail below.

// prompts.json 
{
  "CUSTOM_SYSTEM_PROMPT": "You will be provided with text. Summarize the text in 1 simple sentence.",    
  "USER_PROMPT_TEMPLATE": "{input}" 
}

Register the model: This involves calling Okareo's register_model() function and passing it some parameters that explain the type of model you're registering, the temperature of the model (which controls the level of randomness of the output), and the context the model needs to deal with the prompts you'll send to it.

// Import the prompts (to be used later)
import * as prompts from "../../prompts.json"
// Register your model with Okareo
const model = await okareo.register_model({
  name: MODEL_NAME,
  project_id: project_id,
  models: {
    type: "openai",
    model_id:"gpt-3.5-turbo",
    temperature:0.5,
    system_prompt_template:prompts.CUSTOM_SYSTEM_PROMPT,
    user_prompt_template:prompts.USER_PROMPT_TEMPLATE,
  } as OpenAIModel,
  update: true,
});

Specify the expected outputs from the LLM: You need a test dataset of possible input user prompts paired with corresponding acceptable results that could be output by the LLM for each input. This set of test data is called a scenario in Okareo.

The simplest type of scenario is one that is manually created by you. This is known as a seed scenario, as such scenarios can be used as seeds to generate other more complex scenarios. To create a seed scenario, first define your set of input data and expected results, then pass this data to Okareo's create_scenario_set().

// Define the scenario data 
const TEST_SEED_DATA = [
  SeedData({
    input:"WebBizz is dedicated to providing our customers with a seamless online shopping experience. Our platform is designed with user-friendly interfaces to help you browse and select the best products suitable for your needs. We offer a wide range of products from top brands and new entrants, ensuring diversity and quality in our offerings. Our 24/7 customer support is ready to assist you with any queries, from product details, shipping timelines, to payment methods. We also have a dedicated FAQ section addressing common concerns. Always ensure you are logged in to enjoy personalized product recommendations and faster checkout processes.",
    result:"WebBizz offers a user-friendly online shopping platform with diverse, quality products, 24/7 customer support, and personalized recommendations for a seamless experience."
  }),
  SeedData({
    input:"Safety and security of your data is our top priority at WebBizz. Our platform employs state-of-the-art encryption methods ensuring your personal and financial information remains confidential. Our two-factor authentication at checkout provides an added layer of security. We understand the importance of timely deliveries, hence we've partnered with reliable logistics partners ensuring your products reach you in pristine condition. In case of any delays or issues, our tracking tool can provide real-time updates on your product's location. We believe in transparency and guarantee no hidden fees or charges during your purchase journey.",
    result:"WebBizz prioritizes data security with advanced encryption and two-factor authentication, ensures timely deliveries with reliable logistics, provides real-time tracking, and guarantees no hidden fees."
  }),
  SeedData({
    input:"WebBizz places immense value on its dedicated clientele, recognizing their loyalty through the exclusive 'Premium Club' membership. This special program is designed to enrich the shopping experience, providing a suite of benefits tailored to our valued members. Among the advantages, members enjoy complimentary shipping, granting them a seamless and cost-effective way to receive their purchases. Additionally, the 'Premium Club' offers early access to sales, allowing members to avail themselves of promotional offers before they are opened to the general public.",
    result:"WebBizz values its loyal customers through the exclusive 'Premium Club' membership, offering benefits like complimentary shipping and early access to sales."
  })
];
// Get the ID of your Okareo project (which is needed to create a scenario 
// for your particular project). You can find your project name in the top 
// right of the Okareo app. 
const PROJECT_NAME = "Global";
const project: any[] = await okareo.getProjects();
const project_id = project.find(p => p.name === PROJECT_NAME)?.id; 
// Create the scenario 
const scenario: any = await okareo.create_scenario_set({
  name: "Webbizz Articles for Text Summarization Scenario Set",
  project_id: project_id,
  seed_data: TEST_SEED_DATA
});


If you have a very large amount of data for your scenario, you can upload a scenario set from a file using uploadscenarioset() instead.

Validate the output from the LLM

You can validate the output from the LLM by testing that it passes certain criteria and reporting on its success or failure.

Start by calling the run_test() function, which runs validation on any registered model. You should set the type to NL_GENERATION provided you're working with a natural language model. You'll also need to pass in some checks, which are the criteria by which you want to validate the model's output (in response to your custom system prompt and scenario). Okareo has a number of built-in checks that you can use out of the box by simply passing in their names, but it's also possible to create your own custom checks for validating anything you need.

const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
  model_api_key: OPENAI_API_KEY,
  name: `${MODEL_NAME} Eval`,
  project_id: project_id,
  scenario: scenario,
  calculate_metrics: true,
  type: TestRunType.NL_GENERATION,
  checks: [
    "coherence_summary",
    "consistency_summary",
    "fluency_summary",
    "relevance_summary"
  ]
} as RunTestProps);

Each check has a scale so you can determine how well your product is performing on each metric. For Okareo's pre-baked checks, that scale is often 1–5, but you can choose your own scale for custom checks.

When you run your evaluation, this will create a detailed online report on app.okareo.com, showing statistics on how well each output scored on each check.

Reporting to the command line whether the evaluation has passed

Okareo has reporter objects that can be used to take the statistics from the evaluation and decide whether the evaluation should pass or fail overall. As this example involves text generation, we use a GenerationReporter. You can log the report to your console using reporter.log() — this will print a link to a detailed online report on app.okareo.com.

In order to use the GenerationReporter, you need to set a threshold for each check that determines whether the check passes or fails. In the example below, there is a minimum threshold of 4.0 for each check.

const report_definition = {
  metrics_min: {
    "coherence": 4.0,
    "consistency": 4.0,
    "fluency": 4.0,
    "relevance": 4.0,
  }
};

These check thresholds are then passed to the Okareo reporter:

const reporter = new GenerationReporter({
  eval_run :eval_run,
  ...report_definition,
}); reporter.log();

Finally, there are two kinds of error handling you need to add to your Okareo code. The first is to handle TypeScript runtime errors by putting all your Okareo calls inside a try/catch block and find a way to handle any errors inside there. The other is to handle situations where there are no coding errors but the validation simply failed. For this, you just need to state what happens if the report did not pass:

if (!reporter.pass) {
  //handle error 
}

All this code together can make up one Okareo flow script, and the full code example of this can be found on our GitHub account. To run it, simply use the okareo run command, and you will be able to see if your validation passed or failed.

Adding an Okareo LLM validation to your CI workflow

As long as you have an Okareo flow script file that validates output from LLM-based products, with proper error handling in place, you're ready to begin integrating this into your CI workflow.

The first step is to add your API keys as environment variables in your CI provider. Once this is done, you'll need to create a CI workflow file that installs Okareo and then runs your flow script whenever a push or pull request happens on the main branch of your version control. You can follow our step-by-step guide on integrating Okareo into GitHub Actions for more details on this.

Adding Okareo to your CI workflow increases confidence in your LLM-based products

Using Okareo to automate the validation of your product's LLM's output whenever your custom system prompts change means you can be more confident that your LLM-based product is working as expected. It also means you can stop doing manual validation of the output from the LLM, which is boring, time-consuming and error-prone.

With the time you'll get back, you can focus your energy on building new and better models, and once you integrate Okareo into your CI workflow, your development speed will increase massively, and you'll be able to get new changes deployed much faster.

Okareo is a tool for automating output validation for LLM-based products. Here, we show how to integrate it into your CI workflow.

Validating the output of your LLM-based products is essential if you want to be confident that your application will continue to work as expected, even as you change your prompts and the ways you interact with the model.

One way to validate the output of your LLM-based product is by manually checking that it produces acceptable responses when given a set of example user prompts.

While manual validation is essential during the first stages of development, it unfortunately becomes tedious and time-consuming after a while, which tempts people to start taking shortcuts — skipping any steps they think they can get away with, or being less thorough because they don't have the time to do their validation well.

Okareo is a tool for TypeScript or Python developers that allows you to automate the process of validating your LLM-powered apps, meaning this validation can then be run over and over in a reproducible way.

You can use Okareo for validating products that use either third-party LLMs (such as OpenAI’s models), LLMs you created from scratch, or LLMs that you've fine-tuned. However, in this article we focus on the use case of validating the output of products that use third-party LLMs.

Why you should validate the output of your LLM-based product

Whatever LLM-powered product you're making, trust is everything. If your application produces unreliable or questionable output, your users will lose trust in it and stop using it. Many AI companies will have a quality assurance process in place to avoid this situation, and you may well be required to add validation that takes into account the LLM as part of this QA process. There are two types of changes that would make validation necessary:

When the LLM itself changes: When it comes to the use of third-party LLMs, they can be updated at any moment, or their internal system prompts can change, and you have no control over this. As you don't know when these changes will happen, the only way to manage this type of validation is to schedule a regular validation of the LLM (for example, on an hourly or daily basis) with the hope of catching any changes early.

When your custom system prompts change: With third-party LLMs, you only really have control over the custom system prompts that your product sends to it. You can't control the LLM itself or how the user interacts with your product, but you can control the custom system prompts. Whenever these change, this risks a change in the LLM output, so validation is needed at this stage.

Depending on your application and the model it uses, your validation process might want to check the output of your LLM-powered app against a combination of metrics such as the following:

Accuracy: Check that the LLM isn’t producing plausible-sounding results that are factually wrong in response to your prompts.

Relevance: The output of the LLM should always be relevant to the input question.

Regulatory compliance: Certain types of data (such as healthcare data or legal regulations) need to comply with relevant regulations (like HIPAA or OSHA). In these cases you should define the conformance rules that your app needs to comply with.

💡 Tip: In Okareo, you can use the "custom checks" feature to define your own conformance rules.

Correct format: LLMs can be prompted to produce output in a particular format (such as JSON or Markdown) and you can validate that the expected elements are present in all outputs. For example, if your product is prompting an LLM to generate SDK documentation in Markdown, you could use some validation checks that state that the LLM should output both ::info blocks and script blocks within the Markdown.

Semantic similarity: Some products, such as those that prompt the LLM to do text summarization, should produce output that is semantically similar to the input, and you can add checks to ensure this.

Filtering for inappropriate content: You can use validation to check that anything you don't want to be in the final output definitely isn't there. This could include anything from profanity and violence to illegal content.

The importance of making the validation of your LLM-based app output reproducible

Effective validation of the output of LLM-based apps isn't just a one-and-done thing. You need to set up a process of continuous validation, so that each time your custom system prompt changes, you automatically revalidate the output from the LLM.

A good time to perform output validation is as part of your CI workflow. This will ensure that every single change to your product (including the custom system prompts) always forces an automatic revalidation.

Until now, this has been challenging to implement, as most CI workflows are built with deterministic tools, so they're not very good at validating outputs of products that use LLMs, because LLMs tend to be non-deterministic. However, Okareo has created a solution to this problem: it allows you to automate output validation of LLM-based apps despite the fact that the outputs are non-deterministic.

Your Okareo code will be stored inside the codebase of your AI app — either in a folder called .okareo or as part of your testing suite, such as Jest. Any time you update your application code (including your custom system prompts), the associated pull request causes a CI workflow to be run, which runs your Okareo validation code, performing a number of validation checks on the output from the LLM. Once this validation passes, the code can be deployed to production.

How to validate the output of LLM-based apps in a reproducible way using Okareo

Let's explore how to use Okareo to validate the output of an app that uses an LLM, by choosing a well-known existing LLM — gpt-3.5-turbo on OpenAI. You'll need to be signed up with Okareo and OpenAI and have API keys for both. In this example, we'll be using the Okareo TypeScript SDK, but the Python SDK is also available.

Installation and configuration

Start by installing the Okareo CLI on your local machine. Follow the instructions, which includes creating a directory structure like this:

[Project]   
  .okareo
    config.yml
    flows
      <your_flow_script>.ts

You'll need to specify TypeScript as your language in config.yml and our Okareo code will be placed inside the flows directory. An Okareo flow is just a script that calls different parts of the Okareo API for the purpose of setting up and then validating an LLM.

Next, install dependencies for Okareo's TypeScript SDK and OpenAI's library. Finally, create environment variables for your API keys so they can be referenced in <your_flow_script>.ts and config.yml.

Prepare your product for validation

Before you can validate the output of your product, there are two key steps. First, you must register the model you’re using with Okareo, and then you must specify what the expected output of the LLM should be so there is something to compare the actual output against.

It's worth ensuring that your prompts are defined in a separate file so they can be easily referenced from your application and the Okareo flow file. For this use case of validating the output of LLM-powered products that use third-party LLMs, the only unit of change that you have control over is the custom system prompts — the tailored commands that your app sends to the LLM along with the user inputs.

Note: In the JSON file below, the {input} value of USER_PROMPT_TEMPLATE is a placeholder variable that will later be replaced with each individual input that you send to Okareo as part of a scenario, which is explained in detail below.

// prompts.json 
{
  "CUSTOM_SYSTEM_PROMPT": "You will be provided with text. Summarize the text in 1 simple sentence.",    
  "USER_PROMPT_TEMPLATE": "{input}" 
}

Register the model: This involves calling Okareo's register_model() function and passing it some parameters that explain the type of model you're registering, the temperature of the model (which controls the level of randomness of the output), and the context the model needs to deal with the prompts you'll send to it.

// Import the prompts (to be used later)
import * as prompts from "../../prompts.json"
// Register your model with Okareo
const model = await okareo.register_model({
  name: MODEL_NAME,
  project_id: project_id,
  models: {
    type: "openai",
    model_id:"gpt-3.5-turbo",
    temperature:0.5,
    system_prompt_template:prompts.CUSTOM_SYSTEM_PROMPT,
    user_prompt_template:prompts.USER_PROMPT_TEMPLATE,
  } as OpenAIModel,
  update: true,
});

Specify the expected outputs from the LLM: You need a test dataset of possible input user prompts paired with corresponding acceptable results that could be output by the LLM for each input. This set of test data is called a scenario in Okareo.

The simplest type of scenario is one that is manually created by you. This is known as a seed scenario, as such scenarios can be used as seeds to generate other more complex scenarios. To create a seed scenario, first define your set of input data and expected results, then pass this data to Okareo's create_scenario_set().

// Define the scenario data 
const TEST_SEED_DATA = [
  SeedData({
    input:"WebBizz is dedicated to providing our customers with a seamless online shopping experience. Our platform is designed with user-friendly interfaces to help you browse and select the best products suitable for your needs. We offer a wide range of products from top brands and new entrants, ensuring diversity and quality in our offerings. Our 24/7 customer support is ready to assist you with any queries, from product details, shipping timelines, to payment methods. We also have a dedicated FAQ section addressing common concerns. Always ensure you are logged in to enjoy personalized product recommendations and faster checkout processes.",
    result:"WebBizz offers a user-friendly online shopping platform with diverse, quality products, 24/7 customer support, and personalized recommendations for a seamless experience."
  }),
  SeedData({
    input:"Safety and security of your data is our top priority at WebBizz. Our platform employs state-of-the-art encryption methods ensuring your personal and financial information remains confidential. Our two-factor authentication at checkout provides an added layer of security. We understand the importance of timely deliveries, hence we've partnered with reliable logistics partners ensuring your products reach you in pristine condition. In case of any delays or issues, our tracking tool can provide real-time updates on your product's location. We believe in transparency and guarantee no hidden fees or charges during your purchase journey.",
    result:"WebBizz prioritizes data security with advanced encryption and two-factor authentication, ensures timely deliveries with reliable logistics, provides real-time tracking, and guarantees no hidden fees."
  }),
  SeedData({
    input:"WebBizz places immense value on its dedicated clientele, recognizing their loyalty through the exclusive 'Premium Club' membership. This special program is designed to enrich the shopping experience, providing a suite of benefits tailored to our valued members. Among the advantages, members enjoy complimentary shipping, granting them a seamless and cost-effective way to receive their purchases. Additionally, the 'Premium Club' offers early access to sales, allowing members to avail themselves of promotional offers before they are opened to the general public.",
    result:"WebBizz values its loyal customers through the exclusive 'Premium Club' membership, offering benefits like complimentary shipping and early access to sales."
  })
];
// Get the ID of your Okareo project (which is needed to create a scenario 
// for your particular project). You can find your project name in the top 
// right of the Okareo app. 
const PROJECT_NAME = "Global";
const project: any[] = await okareo.getProjects();
const project_id = project.find(p => p.name === PROJECT_NAME)?.id; 
// Create the scenario 
const scenario: any = await okareo.create_scenario_set({
  name: "Webbizz Articles for Text Summarization Scenario Set",
  project_id: project_id,
  seed_data: TEST_SEED_DATA
});


If you have a very large amount of data for your scenario, you can upload a scenario set from a file using uploadscenarioset() instead.

Validate the output from the LLM

You can validate the output from the LLM by testing that it passes certain criteria and reporting on its success or failure.

Start by calling the run_test() function, which runs validation on any registered model. You should set the type to NL_GENERATION provided you're working with a natural language model. You'll also need to pass in some checks, which are the criteria by which you want to validate the model's output (in response to your custom system prompt and scenario). Okareo has a number of built-in checks that you can use out of the box by simply passing in their names, but it's also possible to create your own custom checks for validating anything you need.

const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
  model_api_key: OPENAI_API_KEY,
  name: `${MODEL_NAME} Eval`,
  project_id: project_id,
  scenario: scenario,
  calculate_metrics: true,
  type: TestRunType.NL_GENERATION,
  checks: [
    "coherence_summary",
    "consistency_summary",
    "fluency_summary",
    "relevance_summary"
  ]
} as RunTestProps);

Each check has a scale so you can determine how well your product is performing on each metric. For Okareo's pre-baked checks, that scale is often 1–5, but you can choose your own scale for custom checks.

When you run your evaluation, this will create a detailed online report on app.okareo.com, showing statistics on how well each output scored on each check.

Reporting to the command line whether the evaluation has passed

Okareo has reporter objects that can be used to take the statistics from the evaluation and decide whether the evaluation should pass or fail overall. As this example involves text generation, we use a GenerationReporter. You can log the report to your console using reporter.log() — this will print a link to a detailed online report on app.okareo.com.

In order to use the GenerationReporter, you need to set a threshold for each check that determines whether the check passes or fails. In the example below, there is a minimum threshold of 4.0 for each check.

const report_definition = {
  metrics_min: {
    "coherence": 4.0,
    "consistency": 4.0,
    "fluency": 4.0,
    "relevance": 4.0,
  }
};

These check thresholds are then passed to the Okareo reporter:

const reporter = new GenerationReporter({
  eval_run :eval_run,
  ...report_definition,
}); reporter.log();

Finally, there are two kinds of error handling you need to add to your Okareo code. The first is to handle TypeScript runtime errors by putting all your Okareo calls inside a try/catch block and find a way to handle any errors inside there. The other is to handle situations where there are no coding errors but the validation simply failed. For this, you just need to state what happens if the report did not pass:

if (!reporter.pass) {
  //handle error 
}

All this code together can make up one Okareo flow script, and the full code example of this can be found on our GitHub account. To run it, simply use the okareo run command, and you will be able to see if your validation passed or failed.

Adding an Okareo LLM validation to your CI workflow

As long as you have an Okareo flow script file that validates output from LLM-based products, with proper error handling in place, you're ready to begin integrating this into your CI workflow.

The first step is to add your API keys as environment variables in your CI provider. Once this is done, you'll need to create a CI workflow file that installs Okareo and then runs your flow script whenever a push or pull request happens on the main branch of your version control. You can follow our step-by-step guide on integrating Okareo into GitHub Actions for more details on this.

Adding Okareo to your CI workflow increases confidence in your LLM-based products

Using Okareo to automate the validation of your product's LLM's output whenever your custom system prompts change means you can be more confident that your LLM-based product is working as expected. It also means you can stop doing manual validation of the output from the LLM, which is boring, time-consuming and error-prone.

With the time you'll get back, you can focus your energy on building new and better models, and once you integrate Okareo into your CI workflow, your development speed will increase massively, and you'll be able to get new changes deployed much faster.

Share:

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.