How to add LLM Evaluation to your CI workflow

Evaluation

Matt Wyman

,

Co-founder of Okareo

Sarah Barber

,

Senior Technical Content Writer

June 13, 2024

Okareo is a large language model (LLM) evaluation tool that allows you to integrate LLM evaluation into your CI workflow. We show how to get Okareo working in your CI.

When building a product that uses third-party LLMs (such as those available in OpenAI), it's important to regularly evaluate that these LLMs still work as expected, especially when you update a model or other parts of your software. However, testing LLMs isn't as straightforward as testing regular application code.

Application code testing can be easily automated, by creating unit tests and running them as part of your continuous integration (CI) pipelines, so that every time your code gets updated, the tests can be re-run by the CI system against the new code. But as LLMs have non-deterministic outputs, it's impossible to write unit tests for the parts of your app that interact with LLMs, as the output will change each time you run the test. Until now, manual testing has been the only real way to ensure the accuracy of LLMs as they continue to develop.

Okareo allows you to skip tedious work on manual tests by offering a way to automate tests for non-deterministic systems like LLMs and other AI models. Using Okareo it's now possible to integrate your automated LLM evaluations into your CI workflow.

What is LLM evaluation?

LLM evaluation is a way to measure how good an LLM is at performing certain tasks — such as text completion, summarization, or question answering. It assesses the performance of an LLM by comparing the model's output against some expected results to check things like accuracy and relevancy.

Without LLM evaluation, you won't be able to check that your model can be generalized beyond the training data you've given it. As LLMs are now being integrated into app development, it's important to have confidence in your model at all times – hence the need to add an LLM evaluation step to your CI workflow.

What is Okareo?

Okareo is a tool for developers that can evaluate the output of the LLMs that power your AI apps. You can write tests that evaluate your LLM in either TypeScript or Python, and it has a CLI interface for running all your LLM evaluations, meaning it can be easily integrated into your CI workflow, no matter which CI provider you use.

Okareo's LLM evaluation process involves comparing a scenario (consisting of inputs to a model along with their corresponding expected results) with the actual results of an LLM given the same inputs.

Okareo treats the LLM as a black box — it's simply interested in whether the output data of the LLM conforms to certain rules or standards, known as checks. Checks are a way to compare the similarity of the expected and actual results according to specific metrics like consistency or relevance.

You can either use Okareo's pre-defined checks or create your own custom ones. Some checks are measured as a simple pass or fail, and others are scored with a range (such as 1-5). For those with a range, you can decide the threshold at which it passes or fails. For example, you might choose to set the minimum pass threshold of relevance to 4 out of 5. By default, all checks must pass in order for your LLM evaluation CI pipeline to pass overall, although this can be overridden with a property called [error_max](https://okareo.com/docs/sdk/okareo_typescript), which specifies how many errors you’re willing to tolerate and still pass.

Flow diagram showing the processes that happen in Okareo as part of LLM evaluation.

How to use Okareo to evaluate an LLM-powered application or feature in CI

The main steps involved are to create an LLM, write evaluation tests for the LLM using Okareo's TypeScript or Python libraries, and finally run those tests as part of a CI workflow using the Okareo CLI. To follow along with this tutorial, there is no need to create your own LLM as we'll be evaluating an existing one (gpt-3.5-turbo on OpenAI).

First, you'll need to sign up to Okareo and OpenAI and get API keys for each account.

This example uses the Okareo TypeScript SDK, but Python is also available.

Running an Okareo flow locally

An Okareo flow is a TypeScript or Python script that calls different parts of the Okareo API to complete an Okareo evaluation. Below are the instructions you'll need to get a flow running locally. We'll be using a text summarization model for our example, and we’ve shared the entire project on GitHub with you to make it easier to follow along.

  1. Install the Okareo CLI: Install this first on your local machine so you can test your Okareo evaluations locally before integrating them into your CI. The installation instructions describe how to set Okareo's language to TypeScript and create a file structure for your Okareo flows. Once you've finished following these instructions you should have a directory structure as follows:

    [Project]   
      .okareo
        config.yml
        flows
          <your_flow_script>.ts
    
    
  2. Install dependencies for Okareo and OpenAI: You can install the required packages with

    npm install okareo-ts-sdk openai


  3. Next, set your API keys and Okareo project ID environment variables so they can be used in <your_flow_script>.ts and config.yml.

  4. Register your model with Okareo: In this case, we're using Open AI's gpt-3.5-turbo model. For this example, we're asking the model to summarize text that is sent to it into a single sentence.

// Define the system prompt to be sent to the OpenAI model 
const SUMMARIZATION_CONTEXT_TEMPLATE = "You will be provided with text. Summarize the text in 1 simple sentence."
// Define the user prompt to be sent to the OpenAI model 
// "{input}" is a placeholder variable that will later be replaced with 
// the input from each item in your scenario  
const USER_PROMPT_TEMPLATE = "{input}"  
// Use a UNIQUE_BUILD_ID which can be used to ensure that the names or tags  
// of anything you create are unique to every CI run  
const UNIQUE_BUILD_ID = (process.env.DEMO_BUILD_ID || `local.${(Math.random() + 1).toString(36).substring(7)}`);  
// Register your model with Okareo  
const model = await okareo.register_model({
  name: MODEL_NAME,
  tags: [`Build:${UNIQUE_BUILD_ID}`],
  project_id: project_id, 
  models: {
    type: "openai",
    model_id:"gpt-3.5-turbo",
    temperature:0.5,
    system_prompt_template:SUMMARIZATION_CONTEXT_TEMPLATE,
    user_prompt_template:USER_PROMPT_TEMPLATE,
  } as OpenAIModel,
  update: true,
});

  1. Create your scenario: Your scenario is a set of inputs and expected results. Scenarios can either be manually created (known as a seed scenario) or generated from previous scenarios. In this example, each result is an ID of a particular document in the gpt-3.5-turbo vector database. The documents have already been manually verified as being acceptable results for the input data.

const TEST_SEED_DATA = [
  SeedData({
    input:"WebBizz is dedicated to providing our customers with a seamless online shopping experience. Our platform is designed with user-friendly interfaces to help you browse and select the best products suitable for your needs. We offer a wide range of products from top brands and new entrants, ensuring diversity and quality in our offerings. Our 24/7 customer support is ready to assist you with any queries, from product details, shipping timelines, to payment methods. We also have a dedicated FAQ section addressing common concerns. Always ensure you are logged in to enjoy personalized product recommendations and faster checkout processes.", 
    result:"75eaa363-dfcc-499f-b2af-1407b43cb133"
  }),
  SeedData({
    input:"Safety and security of your data is our top priority at WebBizz. Our platform employs state-of-the-art encryption methods ensuring your personal and financial information remains confidential. Our two-factor authentication at checkout provides an added layer of security. We understand the importance of timely deliveries, hence we've partnered with reliable logistics partners ensuring your products reach you in pristine condition. In case of any delays or issues, our tracking tool can provide real-time updates on your product's location. We believe in transparency and guarantee no hidden fees or charges during your purchase journey.",
    result:"ac0d464c-f673-44b8-8195-60c965e47525"
  }),
  SeedData({
    input:"WebBizz places immense value on its dedicated clientele, recognizing their loyalty through the exclusive 'Premium Club' membership. This special program is designed to enrich the shopping experience, providing a suite of benefits tailored to our valued members. Among the advantages, members enjoy complimentary shipping, granting them a seamless and cost-effective way to receive their purchases. Additionally, the 'Premium Club' offers early access to sales, allowing members to avail themselves of promotional offers before they are opened to the general public.",
    result:"aacf7a34-9d3a-4e2a-9a5c-91f2a0e8a12d"
  }) 
]; 
// Get the ID of your Okareo project (which is needed to create a scenario 
// for your particular project). You can find your project name in the top 
// right of the Okareo app. 
const PROJECT_NAME = "Global"; 
const project: any[] = await okareo.getProjects(); 
const project_id = project.find(p => p.name === PROJECT_NAME)?.id; 
// Create the scenario 
const scenario: any = await okareo.create_scenario_set({
  name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
  project_id: project_id,
  seed_data: TEST_SEED_DATA
});

Tip: Alternatively, you can upload a scenario set from a file using uploadscenarioset().

  1. Run the LLM evaluation: Call Okareo's run_test() function, remembering to set the type to NL_GENERATION as this is a natural language model that you're testing and Okareo also does testing for other types of models. Also pass in any checks that you want to be done at this stage.

const eval_run: components["schemas"]["TestRunItem"] =
  await model.run_test({
    model_api_key: OPENAI_API_KEY,
    name: `${MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
    tags: [`Build:${UNIQUE_BUILD_ID}`],
    project_id: project_id,
    scenario: scenario,
    calculate_metrics: true,
    type: TestRunType.NL_GENERATION,
    checks: [
      "coherence_summary",
      "consistency_summary",
      "fluency_summary",
      "relevance_summary"
    ]
  } as RunTestProps);
  1. Set the thresholds for your checks: These will be needed for reporting purposes, to help determine whether your evaluation was a success. A common way to do this is to set a minimum threshold that each metric must meet (on average) in order to pass or fail.

const report_definition = {
  metrics_min: {
    "coherence": 4.0,
    "consistency": 4.0,
    "fluency": 4.0,
    "relevance": 4.0,
  }
};

  1. Set up reporting: Okareo's GenerationReporter gives you statistics on how well your evaluation passed each check metric and whether it passed overall.

    Usingreporter.log()will log the details of your success or failure, and print a link to a detailed online report, but you can also report whether the evaluation was a success or failure by checking if reporter.pass is true.

const reporter = new GenerationReporter({
  eval_run :eval_run,
  ...report_definition,
});
reporter.log();
  1. Handle evaluation failures and other errors: There are two types of problem that you need to make sure your code can handle:

    1. Okareo reports that an evaluation did not pass: As we will later be running this code in GitHub Actions, we can use the GitHub Actions core library to handle failures, which will work locally and in GitHub Actions.

      import * as core from "@actions/core";
      if (!reporter.pass) {
        core.setFailed("CI failed because the Okareo reporter failed.");
      }
      
      
    2. TypeScript runtime errors: It's standard practice to handle runtime errors by wrapping your code in a try/catch block. You can continue to use the GitHub Actions core library to report the error.

    try {     
      // TypeScript code that calls Okareo 
    } catch (error) {
      core.setFailed("CI failed because of an error calling Okareo: "+error.message);
    }
  2. Run your Okareo flow script: On your local machine, run the command below.

okareo run -f

This will output reporting data to the CLI from the reporter.log() function, indicating whether the evaluation has passed or failed, and printing a link that takes you to a reporting page in Okareo where you can visualize your results more easily.

Screenshot of command line output for a successful "okareo run" command.Screenshot of command line output for a failed "okareo run" command.Screenshot of the Okareo reporting page for a text summarization LLM evaluation.

Integrating Okareo into your CI workflow

Now that you've got your Okareo flow running locally, you can add it to your CI workflow. To run Okareo in CI, you will need to install the Okareo CLI on your CI server, set up your API keys as CI environment variables, and add your okareo run command to your CI workflow.

Here, we show how to do this in GitHub Actions, but the principles can be easily generalized to any other CI provider, like CircleCI, BitBucket Pipelines, or GitLab CI/CD.

Make sure you already have a GitHub repo with your Okareo project in it (the .okareo directory should be at the top level of the project).

Add secrets as environment variables: On your repository's main page, click Settings > Secrets and variables > Actions, and then New repository secret. Add secrets for OKAREO_API_KEY, OKAREO_PROJECT_ID and OPENAI_API_KEY.

Setup workflow file: Inside your repo, click the Actions tab, then on Set up a workflow as yourself. This will create a GitHub Actions workflow config file at .github/workflows/main.yml.

Add the following code to your main.yml file. The Okareo action installs the Okareo CLI before Okareo tries to run your evaluation.

name: Text summarization Okareo flow
env:
  DEMO_BUILD_ID: ${{ github.run_number }}
  OKAREO_API_KEY: ${{ secrets.OKAREO_API_KEY }}
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  OKAREO_PROJECT_ID: ${{ secrets.OKAREO_PROJECT_ID }}

on:
  push:
    branches: [ "main" ]
  pull_request:
    branches: [ "main" ]

jobs:   
  text-summarization:
    runs-on: ubuntu-latest
    defaults:
    run:
    working-directory: .
    permissions:
    contents: 'read'
    id-token: 'write'
    steps:
      - name: Checkout
      uses: actions/checkout@v4
      
      - name: Okareo Action
      uses: okareo-ai/okareo-action@v2.5
      
      - name: Text Summarization Evaluation
      
      run: |
        okareo -v
        okareo run -f text_summarization 

Now save your workflow file. This will trigger your GitHub Actions workflow to run (as will any push or pull request to your main branch).

Screenshot of the LLM evaluation running in a CI workflow (GitHub Actions), with each step checked off to show a successful run.

Okareo automates the process of LLM evaluation

Start evaluating LLM applications in CI with Okareo and ensure that your LLM-powered functionality works as expected while also saving you countless hours of manual testing. Okareo is free to use for smaller projects (of up to 5k model data points, 1k evaluated rows and 50k scenario tokens).

You can get started with Okareo immediately by signing up and installing the software. The example evaluation in this tutorial can help you get started, as well as our cookbook project which contains many working examples.

Okareo is a large language model (LLM) evaluation tool that allows you to integrate LLM evaluation into your CI workflow. We show how to get Okareo working in your CI.

When building a product that uses third-party LLMs (such as those available in OpenAI), it's important to regularly evaluate that these LLMs still work as expected, especially when you update a model or other parts of your software. However, testing LLMs isn't as straightforward as testing regular application code.

Application code testing can be easily automated, by creating unit tests and running them as part of your continuous integration (CI) pipelines, so that every time your code gets updated, the tests can be re-run by the CI system against the new code. But as LLMs have non-deterministic outputs, it's impossible to write unit tests for the parts of your app that interact with LLMs, as the output will change each time you run the test. Until now, manual testing has been the only real way to ensure the accuracy of LLMs as they continue to develop.

Okareo allows you to skip tedious work on manual tests by offering a way to automate tests for non-deterministic systems like LLMs and other AI models. Using Okareo it's now possible to integrate your automated LLM evaluations into your CI workflow.

What is LLM evaluation?

LLM evaluation is a way to measure how good an LLM is at performing certain tasks — such as text completion, summarization, or question answering. It assesses the performance of an LLM by comparing the model's output against some expected results to check things like accuracy and relevancy.

Without LLM evaluation, you won't be able to check that your model can be generalized beyond the training data you've given it. As LLMs are now being integrated into app development, it's important to have confidence in your model at all times – hence the need to add an LLM evaluation step to your CI workflow.

What is Okareo?

Okareo is a tool for developers that can evaluate the output of the LLMs that power your AI apps. You can write tests that evaluate your LLM in either TypeScript or Python, and it has a CLI interface for running all your LLM evaluations, meaning it can be easily integrated into your CI workflow, no matter which CI provider you use.

Okareo's LLM evaluation process involves comparing a scenario (consisting of inputs to a model along with their corresponding expected results) with the actual results of an LLM given the same inputs.

Okareo treats the LLM as a black box — it's simply interested in whether the output data of the LLM conforms to certain rules or standards, known as checks. Checks are a way to compare the similarity of the expected and actual results according to specific metrics like consistency or relevance.

You can either use Okareo's pre-defined checks or create your own custom ones. Some checks are measured as a simple pass or fail, and others are scored with a range (such as 1-5). For those with a range, you can decide the threshold at which it passes or fails. For example, you might choose to set the minimum pass threshold of relevance to 4 out of 5. By default, all checks must pass in order for your LLM evaluation CI pipeline to pass overall, although this can be overridden with a property called [error_max](https://okareo.com/docs/sdk/okareo_typescript), which specifies how many errors you’re willing to tolerate and still pass.

Flow diagram showing the processes that happen in Okareo as part of LLM evaluation.

How to use Okareo to evaluate an LLM-powered application or feature in CI

The main steps involved are to create an LLM, write evaluation tests for the LLM using Okareo's TypeScript or Python libraries, and finally run those tests as part of a CI workflow using the Okareo CLI. To follow along with this tutorial, there is no need to create your own LLM as we'll be evaluating an existing one (gpt-3.5-turbo on OpenAI).

First, you'll need to sign up to Okareo and OpenAI and get API keys for each account.

This example uses the Okareo TypeScript SDK, but Python is also available.

Running an Okareo flow locally

An Okareo flow is a TypeScript or Python script that calls different parts of the Okareo API to complete an Okareo evaluation. Below are the instructions you'll need to get a flow running locally. We'll be using a text summarization model for our example, and we’ve shared the entire project on GitHub with you to make it easier to follow along.

  1. Install the Okareo CLI: Install this first on your local machine so you can test your Okareo evaluations locally before integrating them into your CI. The installation instructions describe how to set Okareo's language to TypeScript and create a file structure for your Okareo flows. Once you've finished following these instructions you should have a directory structure as follows:

    [Project]   
      .okareo
        config.yml
        flows
          <your_flow_script>.ts
    
    
  2. Install dependencies for Okareo and OpenAI: You can install the required packages with

    npm install okareo-ts-sdk openai


  3. Next, set your API keys and Okareo project ID environment variables so they can be used in <your_flow_script>.ts and config.yml.

  4. Register your model with Okareo: In this case, we're using Open AI's gpt-3.5-turbo model. For this example, we're asking the model to summarize text that is sent to it into a single sentence.

// Define the system prompt to be sent to the OpenAI model 
const SUMMARIZATION_CONTEXT_TEMPLATE = "You will be provided with text. Summarize the text in 1 simple sentence."
// Define the user prompt to be sent to the OpenAI model 
// "{input}" is a placeholder variable that will later be replaced with 
// the input from each item in your scenario  
const USER_PROMPT_TEMPLATE = "{input}"  
// Use a UNIQUE_BUILD_ID which can be used to ensure that the names or tags  
// of anything you create are unique to every CI run  
const UNIQUE_BUILD_ID = (process.env.DEMO_BUILD_ID || `local.${(Math.random() + 1).toString(36).substring(7)}`);  
// Register your model with Okareo  
const model = await okareo.register_model({
  name: MODEL_NAME,
  tags: [`Build:${UNIQUE_BUILD_ID}`],
  project_id: project_id, 
  models: {
    type: "openai",
    model_id:"gpt-3.5-turbo",
    temperature:0.5,
    system_prompt_template:SUMMARIZATION_CONTEXT_TEMPLATE,
    user_prompt_template:USER_PROMPT_TEMPLATE,
  } as OpenAIModel,
  update: true,
});

  1. Create your scenario: Your scenario is a set of inputs and expected results. Scenarios can either be manually created (known as a seed scenario) or generated from previous scenarios. In this example, each result is an ID of a particular document in the gpt-3.5-turbo vector database. The documents have already been manually verified as being acceptable results for the input data.

const TEST_SEED_DATA = [
  SeedData({
    input:"WebBizz is dedicated to providing our customers with a seamless online shopping experience. Our platform is designed with user-friendly interfaces to help you browse and select the best products suitable for your needs. We offer a wide range of products from top brands and new entrants, ensuring diversity and quality in our offerings. Our 24/7 customer support is ready to assist you with any queries, from product details, shipping timelines, to payment methods. We also have a dedicated FAQ section addressing common concerns. Always ensure you are logged in to enjoy personalized product recommendations and faster checkout processes.", 
    result:"75eaa363-dfcc-499f-b2af-1407b43cb133"
  }),
  SeedData({
    input:"Safety and security of your data is our top priority at WebBizz. Our platform employs state-of-the-art encryption methods ensuring your personal and financial information remains confidential. Our two-factor authentication at checkout provides an added layer of security. We understand the importance of timely deliveries, hence we've partnered with reliable logistics partners ensuring your products reach you in pristine condition. In case of any delays or issues, our tracking tool can provide real-time updates on your product's location. We believe in transparency and guarantee no hidden fees or charges during your purchase journey.",
    result:"ac0d464c-f673-44b8-8195-60c965e47525"
  }),
  SeedData({
    input:"WebBizz places immense value on its dedicated clientele, recognizing their loyalty through the exclusive 'Premium Club' membership. This special program is designed to enrich the shopping experience, providing a suite of benefits tailored to our valued members. Among the advantages, members enjoy complimentary shipping, granting them a seamless and cost-effective way to receive their purchases. Additionally, the 'Premium Club' offers early access to sales, allowing members to avail themselves of promotional offers before they are opened to the general public.",
    result:"aacf7a34-9d3a-4e2a-9a5c-91f2a0e8a12d"
  }) 
]; 
// Get the ID of your Okareo project (which is needed to create a scenario 
// for your particular project). You can find your project name in the top 
// right of the Okareo app. 
const PROJECT_NAME = "Global"; 
const project: any[] = await okareo.getProjects(); 
const project_id = project.find(p => p.name === PROJECT_NAME)?.id; 
// Create the scenario 
const scenario: any = await okareo.create_scenario_set({
  name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
  project_id: project_id,
  seed_data: TEST_SEED_DATA
});

Tip: Alternatively, you can upload a scenario set from a file using uploadscenarioset().

  1. Run the LLM evaluation: Call Okareo's run_test() function, remembering to set the type to NL_GENERATION as this is a natural language model that you're testing and Okareo also does testing for other types of models. Also pass in any checks that you want to be done at this stage.

const eval_run: components["schemas"]["TestRunItem"] =
  await model.run_test({
    model_api_key: OPENAI_API_KEY,
    name: `${MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
    tags: [`Build:${UNIQUE_BUILD_ID}`],
    project_id: project_id,
    scenario: scenario,
    calculate_metrics: true,
    type: TestRunType.NL_GENERATION,
    checks: [
      "coherence_summary",
      "consistency_summary",
      "fluency_summary",
      "relevance_summary"
    ]
  } as RunTestProps);
  1. Set the thresholds for your checks: These will be needed for reporting purposes, to help determine whether your evaluation was a success. A common way to do this is to set a minimum threshold that each metric must meet (on average) in order to pass or fail.

const report_definition = {
  metrics_min: {
    "coherence": 4.0,
    "consistency": 4.0,
    "fluency": 4.0,
    "relevance": 4.0,
  }
};

  1. Set up reporting: Okareo's GenerationReporter gives you statistics on how well your evaluation passed each check metric and whether it passed overall.

    Usingreporter.log()will log the details of your success or failure, and print a link to a detailed online report, but you can also report whether the evaluation was a success or failure by checking if reporter.pass is true.

const reporter = new GenerationReporter({
  eval_run :eval_run,
  ...report_definition,
});
reporter.log();
  1. Handle evaluation failures and other errors: There are two types of problem that you need to make sure your code can handle:

    1. Okareo reports that an evaluation did not pass: As we will later be running this code in GitHub Actions, we can use the GitHub Actions core library to handle failures, which will work locally and in GitHub Actions.

      import * as core from "@actions/core";
      if (!reporter.pass) {
        core.setFailed("CI failed because the Okareo reporter failed.");
      }
      
      
    2. TypeScript runtime errors: It's standard practice to handle runtime errors by wrapping your code in a try/catch block. You can continue to use the GitHub Actions core library to report the error.

    try {     
      // TypeScript code that calls Okareo 
    } catch (error) {
      core.setFailed("CI failed because of an error calling Okareo: "+error.message);
    }
  2. Run your Okareo flow script: On your local machine, run the command below.

okareo run -f

This will output reporting data to the CLI from the reporter.log() function, indicating whether the evaluation has passed or failed, and printing a link that takes you to a reporting page in Okareo where you can visualize your results more easily.

Screenshot of command line output for a successful "okareo run" command.Screenshot of command line output for a failed "okareo run" command.Screenshot of the Okareo reporting page for a text summarization LLM evaluation.

Integrating Okareo into your CI workflow

Now that you've got your Okareo flow running locally, you can add it to your CI workflow. To run Okareo in CI, you will need to install the Okareo CLI on your CI server, set up your API keys as CI environment variables, and add your okareo run command to your CI workflow.

Here, we show how to do this in GitHub Actions, but the principles can be easily generalized to any other CI provider, like CircleCI, BitBucket Pipelines, or GitLab CI/CD.

Make sure you already have a GitHub repo with your Okareo project in it (the .okareo directory should be at the top level of the project).

Add secrets as environment variables: On your repository's main page, click Settings > Secrets and variables > Actions, and then New repository secret. Add secrets for OKAREO_API_KEY, OKAREO_PROJECT_ID and OPENAI_API_KEY.

Setup workflow file: Inside your repo, click the Actions tab, then on Set up a workflow as yourself. This will create a GitHub Actions workflow config file at .github/workflows/main.yml.

Add the following code to your main.yml file. The Okareo action installs the Okareo CLI before Okareo tries to run your evaluation.

name: Text summarization Okareo flow
env:
  DEMO_BUILD_ID: ${{ github.run_number }}
  OKAREO_API_KEY: ${{ secrets.OKAREO_API_KEY }}
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  OKAREO_PROJECT_ID: ${{ secrets.OKAREO_PROJECT_ID }}

on:
  push:
    branches: [ "main" ]
  pull_request:
    branches: [ "main" ]

jobs:   
  text-summarization:
    runs-on: ubuntu-latest
    defaults:
    run:
    working-directory: .
    permissions:
    contents: 'read'
    id-token: 'write'
    steps:
      - name: Checkout
      uses: actions/checkout@v4
      
      - name: Okareo Action
      uses: okareo-ai/okareo-action@v2.5
      
      - name: Text Summarization Evaluation
      
      run: |
        okareo -v
        okareo run -f text_summarization 

Now save your workflow file. This will trigger your GitHub Actions workflow to run (as will any push or pull request to your main branch).

Screenshot of the LLM evaluation running in a CI workflow (GitHub Actions), with each step checked off to show a successful run.

Okareo automates the process of LLM evaluation

Start evaluating LLM applications in CI with Okareo and ensure that your LLM-powered functionality works as expected while also saving you countless hours of manual testing. Okareo is free to use for smaller projects (of up to 5k model data points, 1k evaluated rows and 50k scenario tokens).

You can get started with Okareo immediately by signing up and installing the software. The example evaluation in this tutorial can help you get started, as well as our cookbook project which contains many working examples.

Okareo is a large language model (LLM) evaluation tool that allows you to integrate LLM evaluation into your CI workflow. We show how to get Okareo working in your CI.

When building a product that uses third-party LLMs (such as those available in OpenAI), it's important to regularly evaluate that these LLMs still work as expected, especially when you update a model or other parts of your software. However, testing LLMs isn't as straightforward as testing regular application code.

Application code testing can be easily automated, by creating unit tests and running them as part of your continuous integration (CI) pipelines, so that every time your code gets updated, the tests can be re-run by the CI system against the new code. But as LLMs have non-deterministic outputs, it's impossible to write unit tests for the parts of your app that interact with LLMs, as the output will change each time you run the test. Until now, manual testing has been the only real way to ensure the accuracy of LLMs as they continue to develop.

Okareo allows you to skip tedious work on manual tests by offering a way to automate tests for non-deterministic systems like LLMs and other AI models. Using Okareo it's now possible to integrate your automated LLM evaluations into your CI workflow.

What is LLM evaluation?

LLM evaluation is a way to measure how good an LLM is at performing certain tasks — such as text completion, summarization, or question answering. It assesses the performance of an LLM by comparing the model's output against some expected results to check things like accuracy and relevancy.

Without LLM evaluation, you won't be able to check that your model can be generalized beyond the training data you've given it. As LLMs are now being integrated into app development, it's important to have confidence in your model at all times – hence the need to add an LLM evaluation step to your CI workflow.

What is Okareo?

Okareo is a tool for developers that can evaluate the output of the LLMs that power your AI apps. You can write tests that evaluate your LLM in either TypeScript or Python, and it has a CLI interface for running all your LLM evaluations, meaning it can be easily integrated into your CI workflow, no matter which CI provider you use.

Okareo's LLM evaluation process involves comparing a scenario (consisting of inputs to a model along with their corresponding expected results) with the actual results of an LLM given the same inputs.

Okareo treats the LLM as a black box — it's simply interested in whether the output data of the LLM conforms to certain rules or standards, known as checks. Checks are a way to compare the similarity of the expected and actual results according to specific metrics like consistency or relevance.

You can either use Okareo's pre-defined checks or create your own custom ones. Some checks are measured as a simple pass or fail, and others are scored with a range (such as 1-5). For those with a range, you can decide the threshold at which it passes or fails. For example, you might choose to set the minimum pass threshold of relevance to 4 out of 5. By default, all checks must pass in order for your LLM evaluation CI pipeline to pass overall, although this can be overridden with a property called [error_max](https://okareo.com/docs/sdk/okareo_typescript), which specifies how many errors you’re willing to tolerate and still pass.

Flow diagram showing the processes that happen in Okareo as part of LLM evaluation.

How to use Okareo to evaluate an LLM-powered application or feature in CI

The main steps involved are to create an LLM, write evaluation tests for the LLM using Okareo's TypeScript or Python libraries, and finally run those tests as part of a CI workflow using the Okareo CLI. To follow along with this tutorial, there is no need to create your own LLM as we'll be evaluating an existing one (gpt-3.5-turbo on OpenAI).

First, you'll need to sign up to Okareo and OpenAI and get API keys for each account.

This example uses the Okareo TypeScript SDK, but Python is also available.

Running an Okareo flow locally

An Okareo flow is a TypeScript or Python script that calls different parts of the Okareo API to complete an Okareo evaluation. Below are the instructions you'll need to get a flow running locally. We'll be using a text summarization model for our example, and we’ve shared the entire project on GitHub with you to make it easier to follow along.

  1. Install the Okareo CLI: Install this first on your local machine so you can test your Okareo evaluations locally before integrating them into your CI. The installation instructions describe how to set Okareo's language to TypeScript and create a file structure for your Okareo flows. Once you've finished following these instructions you should have a directory structure as follows:

    [Project]   
      .okareo
        config.yml
        flows
          <your_flow_script>.ts
    
    
  2. Install dependencies for Okareo and OpenAI: You can install the required packages with

    npm install okareo-ts-sdk openai


  3. Next, set your API keys and Okareo project ID environment variables so they can be used in <your_flow_script>.ts and config.yml.

  4. Register your model with Okareo: In this case, we're using Open AI's gpt-3.5-turbo model. For this example, we're asking the model to summarize text that is sent to it into a single sentence.

// Define the system prompt to be sent to the OpenAI model 
const SUMMARIZATION_CONTEXT_TEMPLATE = "You will be provided with text. Summarize the text in 1 simple sentence."
// Define the user prompt to be sent to the OpenAI model 
// "{input}" is a placeholder variable that will later be replaced with 
// the input from each item in your scenario  
const USER_PROMPT_TEMPLATE = "{input}"  
// Use a UNIQUE_BUILD_ID which can be used to ensure that the names or tags  
// of anything you create are unique to every CI run  
const UNIQUE_BUILD_ID = (process.env.DEMO_BUILD_ID || `local.${(Math.random() + 1).toString(36).substring(7)}`);  
// Register your model with Okareo  
const model = await okareo.register_model({
  name: MODEL_NAME,
  tags: [`Build:${UNIQUE_BUILD_ID}`],
  project_id: project_id, 
  models: {
    type: "openai",
    model_id:"gpt-3.5-turbo",
    temperature:0.5,
    system_prompt_template:SUMMARIZATION_CONTEXT_TEMPLATE,
    user_prompt_template:USER_PROMPT_TEMPLATE,
  } as OpenAIModel,
  update: true,
});

  1. Create your scenario: Your scenario is a set of inputs and expected results. Scenarios can either be manually created (known as a seed scenario) or generated from previous scenarios. In this example, each result is an ID of a particular document in the gpt-3.5-turbo vector database. The documents have already been manually verified as being acceptable results for the input data.

const TEST_SEED_DATA = [
  SeedData({
    input:"WebBizz is dedicated to providing our customers with a seamless online shopping experience. Our platform is designed with user-friendly interfaces to help you browse and select the best products suitable for your needs. We offer a wide range of products from top brands and new entrants, ensuring diversity and quality in our offerings. Our 24/7 customer support is ready to assist you with any queries, from product details, shipping timelines, to payment methods. We also have a dedicated FAQ section addressing common concerns. Always ensure you are logged in to enjoy personalized product recommendations and faster checkout processes.", 
    result:"75eaa363-dfcc-499f-b2af-1407b43cb133"
  }),
  SeedData({
    input:"Safety and security of your data is our top priority at WebBizz. Our platform employs state-of-the-art encryption methods ensuring your personal and financial information remains confidential. Our two-factor authentication at checkout provides an added layer of security. We understand the importance of timely deliveries, hence we've partnered with reliable logistics partners ensuring your products reach you in pristine condition. In case of any delays or issues, our tracking tool can provide real-time updates on your product's location. We believe in transparency and guarantee no hidden fees or charges during your purchase journey.",
    result:"ac0d464c-f673-44b8-8195-60c965e47525"
  }),
  SeedData({
    input:"WebBizz places immense value on its dedicated clientele, recognizing their loyalty through the exclusive 'Premium Club' membership. This special program is designed to enrich the shopping experience, providing a suite of benefits tailored to our valued members. Among the advantages, members enjoy complimentary shipping, granting them a seamless and cost-effective way to receive their purchases. Additionally, the 'Premium Club' offers early access to sales, allowing members to avail themselves of promotional offers before they are opened to the general public.",
    result:"aacf7a34-9d3a-4e2a-9a5c-91f2a0e8a12d"
  }) 
]; 
// Get the ID of your Okareo project (which is needed to create a scenario 
// for your particular project). You can find your project name in the top 
// right of the Okareo app. 
const PROJECT_NAME = "Global"; 
const project: any[] = await okareo.getProjects(); 
const project_id = project.find(p => p.name === PROJECT_NAME)?.id; 
// Create the scenario 
const scenario: any = await okareo.create_scenario_set({
  name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
  project_id: project_id,
  seed_data: TEST_SEED_DATA
});

Tip: Alternatively, you can upload a scenario set from a file using uploadscenarioset().

  1. Run the LLM evaluation: Call Okareo's run_test() function, remembering to set the type to NL_GENERATION as this is a natural language model that you're testing and Okareo also does testing for other types of models. Also pass in any checks that you want to be done at this stage.

const eval_run: components["schemas"]["TestRunItem"] =
  await model.run_test({
    model_api_key: OPENAI_API_KEY,
    name: `${MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
    tags: [`Build:${UNIQUE_BUILD_ID}`],
    project_id: project_id,
    scenario: scenario,
    calculate_metrics: true,
    type: TestRunType.NL_GENERATION,
    checks: [
      "coherence_summary",
      "consistency_summary",
      "fluency_summary",
      "relevance_summary"
    ]
  } as RunTestProps);
  1. Set the thresholds for your checks: These will be needed for reporting purposes, to help determine whether your evaluation was a success. A common way to do this is to set a minimum threshold that each metric must meet (on average) in order to pass or fail.

const report_definition = {
  metrics_min: {
    "coherence": 4.0,
    "consistency": 4.0,
    "fluency": 4.0,
    "relevance": 4.0,
  }
};

  1. Set up reporting: Okareo's GenerationReporter gives you statistics on how well your evaluation passed each check metric and whether it passed overall.

    Usingreporter.log()will log the details of your success or failure, and print a link to a detailed online report, but you can also report whether the evaluation was a success or failure by checking if reporter.pass is true.

const reporter = new GenerationReporter({
  eval_run :eval_run,
  ...report_definition,
});
reporter.log();
  1. Handle evaluation failures and other errors: There are two types of problem that you need to make sure your code can handle:

    1. Okareo reports that an evaluation did not pass: As we will later be running this code in GitHub Actions, we can use the GitHub Actions core library to handle failures, which will work locally and in GitHub Actions.

      import * as core from "@actions/core";
      if (!reporter.pass) {
        core.setFailed("CI failed because the Okareo reporter failed.");
      }
      
      
    2. TypeScript runtime errors: It's standard practice to handle runtime errors by wrapping your code in a try/catch block. You can continue to use the GitHub Actions core library to report the error.

    try {     
      // TypeScript code that calls Okareo 
    } catch (error) {
      core.setFailed("CI failed because of an error calling Okareo: "+error.message);
    }
  2. Run your Okareo flow script: On your local machine, run the command below.

okareo run -f

This will output reporting data to the CLI from the reporter.log() function, indicating whether the evaluation has passed or failed, and printing a link that takes you to a reporting page in Okareo where you can visualize your results more easily.

Screenshot of command line output for a successful "okareo run" command.Screenshot of command line output for a failed "okareo run" command.Screenshot of the Okareo reporting page for a text summarization LLM evaluation.

Integrating Okareo into your CI workflow

Now that you've got your Okareo flow running locally, you can add it to your CI workflow. To run Okareo in CI, you will need to install the Okareo CLI on your CI server, set up your API keys as CI environment variables, and add your okareo run command to your CI workflow.

Here, we show how to do this in GitHub Actions, but the principles can be easily generalized to any other CI provider, like CircleCI, BitBucket Pipelines, or GitLab CI/CD.

Make sure you already have a GitHub repo with your Okareo project in it (the .okareo directory should be at the top level of the project).

Add secrets as environment variables: On your repository's main page, click Settings > Secrets and variables > Actions, and then New repository secret. Add secrets for OKAREO_API_KEY, OKAREO_PROJECT_ID and OPENAI_API_KEY.

Setup workflow file: Inside your repo, click the Actions tab, then on Set up a workflow as yourself. This will create a GitHub Actions workflow config file at .github/workflows/main.yml.

Add the following code to your main.yml file. The Okareo action installs the Okareo CLI before Okareo tries to run your evaluation.

name: Text summarization Okareo flow
env:
  DEMO_BUILD_ID: ${{ github.run_number }}
  OKAREO_API_KEY: ${{ secrets.OKAREO_API_KEY }}
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  OKAREO_PROJECT_ID: ${{ secrets.OKAREO_PROJECT_ID }}

on:
  push:
    branches: [ "main" ]
  pull_request:
    branches: [ "main" ]

jobs:   
  text-summarization:
    runs-on: ubuntu-latest
    defaults:
    run:
    working-directory: .
    permissions:
    contents: 'read'
    id-token: 'write'
    steps:
      - name: Checkout
      uses: actions/checkout@v4
      
      - name: Okareo Action
      uses: okareo-ai/okareo-action@v2.5
      
      - name: Text Summarization Evaluation
      
      run: |
        okareo -v
        okareo run -f text_summarization 

Now save your workflow file. This will trigger your GitHub Actions workflow to run (as will any push or pull request to your main branch).

Screenshot of the LLM evaluation running in a CI workflow (GitHub Actions), with each step checked off to show a successful run.

Okareo automates the process of LLM evaluation

Start evaluating LLM applications in CI with Okareo and ensure that your LLM-powered functionality works as expected while also saving you countless hours of manual testing. Okareo is free to use for smaller projects (of up to 5k model data points, 1k evaluated rows and 50k scenario tokens).

You can get started with Okareo immediately by signing up and installing the software. The example evaluation in this tutorial can help you get started, as well as our cookbook project which contains many working examples.

Okareo is a large language model (LLM) evaluation tool that allows you to integrate LLM evaluation into your CI workflow. We show how to get Okareo working in your CI.

When building a product that uses third-party LLMs (such as those available in OpenAI), it's important to regularly evaluate that these LLMs still work as expected, especially when you update a model or other parts of your software. However, testing LLMs isn't as straightforward as testing regular application code.

Application code testing can be easily automated, by creating unit tests and running them as part of your continuous integration (CI) pipelines, so that every time your code gets updated, the tests can be re-run by the CI system against the new code. But as LLMs have non-deterministic outputs, it's impossible to write unit tests for the parts of your app that interact with LLMs, as the output will change each time you run the test. Until now, manual testing has been the only real way to ensure the accuracy of LLMs as they continue to develop.

Okareo allows you to skip tedious work on manual tests by offering a way to automate tests for non-deterministic systems like LLMs and other AI models. Using Okareo it's now possible to integrate your automated LLM evaluations into your CI workflow.

What is LLM evaluation?

LLM evaluation is a way to measure how good an LLM is at performing certain tasks — such as text completion, summarization, or question answering. It assesses the performance of an LLM by comparing the model's output against some expected results to check things like accuracy and relevancy.

Without LLM evaluation, you won't be able to check that your model can be generalized beyond the training data you've given it. As LLMs are now being integrated into app development, it's important to have confidence in your model at all times – hence the need to add an LLM evaluation step to your CI workflow.

What is Okareo?

Okareo is a tool for developers that can evaluate the output of the LLMs that power your AI apps. You can write tests that evaluate your LLM in either TypeScript or Python, and it has a CLI interface for running all your LLM evaluations, meaning it can be easily integrated into your CI workflow, no matter which CI provider you use.

Okareo's LLM evaluation process involves comparing a scenario (consisting of inputs to a model along with their corresponding expected results) with the actual results of an LLM given the same inputs.

Okareo treats the LLM as a black box — it's simply interested in whether the output data of the LLM conforms to certain rules or standards, known as checks. Checks are a way to compare the similarity of the expected and actual results according to specific metrics like consistency or relevance.

You can either use Okareo's pre-defined checks or create your own custom ones. Some checks are measured as a simple pass or fail, and others are scored with a range (such as 1-5). For those with a range, you can decide the threshold at which it passes or fails. For example, you might choose to set the minimum pass threshold of relevance to 4 out of 5. By default, all checks must pass in order for your LLM evaluation CI pipeline to pass overall, although this can be overridden with a property called [error_max](https://okareo.com/docs/sdk/okareo_typescript), which specifies how many errors you’re willing to tolerate and still pass.

Flow diagram showing the processes that happen in Okareo as part of LLM evaluation.

How to use Okareo to evaluate an LLM-powered application or feature in CI

The main steps involved are to create an LLM, write evaluation tests for the LLM using Okareo's TypeScript or Python libraries, and finally run those tests as part of a CI workflow using the Okareo CLI. To follow along with this tutorial, there is no need to create your own LLM as we'll be evaluating an existing one (gpt-3.5-turbo on OpenAI).

First, you'll need to sign up to Okareo and OpenAI and get API keys for each account.

This example uses the Okareo TypeScript SDK, but Python is also available.

Running an Okareo flow locally

An Okareo flow is a TypeScript or Python script that calls different parts of the Okareo API to complete an Okareo evaluation. Below are the instructions you'll need to get a flow running locally. We'll be using a text summarization model for our example, and we’ve shared the entire project on GitHub with you to make it easier to follow along.

  1. Install the Okareo CLI: Install this first on your local machine so you can test your Okareo evaluations locally before integrating them into your CI. The installation instructions describe how to set Okareo's language to TypeScript and create a file structure for your Okareo flows. Once you've finished following these instructions you should have a directory structure as follows:

    [Project]   
      .okareo
        config.yml
        flows
          <your_flow_script>.ts
    
    
  2. Install dependencies for Okareo and OpenAI: You can install the required packages with

    npm install okareo-ts-sdk openai


  3. Next, set your API keys and Okareo project ID environment variables so they can be used in <your_flow_script>.ts and config.yml.

  4. Register your model with Okareo: In this case, we're using Open AI's gpt-3.5-turbo model. For this example, we're asking the model to summarize text that is sent to it into a single sentence.

// Define the system prompt to be sent to the OpenAI model 
const SUMMARIZATION_CONTEXT_TEMPLATE = "You will be provided with text. Summarize the text in 1 simple sentence."
// Define the user prompt to be sent to the OpenAI model 
// "{input}" is a placeholder variable that will later be replaced with 
// the input from each item in your scenario  
const USER_PROMPT_TEMPLATE = "{input}"  
// Use a UNIQUE_BUILD_ID which can be used to ensure that the names or tags  
// of anything you create are unique to every CI run  
const UNIQUE_BUILD_ID = (process.env.DEMO_BUILD_ID || `local.${(Math.random() + 1).toString(36).substring(7)}`);  
// Register your model with Okareo  
const model = await okareo.register_model({
  name: MODEL_NAME,
  tags: [`Build:${UNIQUE_BUILD_ID}`],
  project_id: project_id, 
  models: {
    type: "openai",
    model_id:"gpt-3.5-turbo",
    temperature:0.5,
    system_prompt_template:SUMMARIZATION_CONTEXT_TEMPLATE,
    user_prompt_template:USER_PROMPT_TEMPLATE,
  } as OpenAIModel,
  update: true,
});

  1. Create your scenario: Your scenario is a set of inputs and expected results. Scenarios can either be manually created (known as a seed scenario) or generated from previous scenarios. In this example, each result is an ID of a particular document in the gpt-3.5-turbo vector database. The documents have already been manually verified as being acceptable results for the input data.

const TEST_SEED_DATA = [
  SeedData({
    input:"WebBizz is dedicated to providing our customers with a seamless online shopping experience. Our platform is designed with user-friendly interfaces to help you browse and select the best products suitable for your needs. We offer a wide range of products from top brands and new entrants, ensuring diversity and quality in our offerings. Our 24/7 customer support is ready to assist you with any queries, from product details, shipping timelines, to payment methods. We also have a dedicated FAQ section addressing common concerns. Always ensure you are logged in to enjoy personalized product recommendations and faster checkout processes.", 
    result:"75eaa363-dfcc-499f-b2af-1407b43cb133"
  }),
  SeedData({
    input:"Safety and security of your data is our top priority at WebBizz. Our platform employs state-of-the-art encryption methods ensuring your personal and financial information remains confidential. Our two-factor authentication at checkout provides an added layer of security. We understand the importance of timely deliveries, hence we've partnered with reliable logistics partners ensuring your products reach you in pristine condition. In case of any delays or issues, our tracking tool can provide real-time updates on your product's location. We believe in transparency and guarantee no hidden fees or charges during your purchase journey.",
    result:"ac0d464c-f673-44b8-8195-60c965e47525"
  }),
  SeedData({
    input:"WebBizz places immense value on its dedicated clientele, recognizing their loyalty through the exclusive 'Premium Club' membership. This special program is designed to enrich the shopping experience, providing a suite of benefits tailored to our valued members. Among the advantages, members enjoy complimentary shipping, granting them a seamless and cost-effective way to receive their purchases. Additionally, the 'Premium Club' offers early access to sales, allowing members to avail themselves of promotional offers before they are opened to the general public.",
    result:"aacf7a34-9d3a-4e2a-9a5c-91f2a0e8a12d"
  }) 
]; 
// Get the ID of your Okareo project (which is needed to create a scenario 
// for your particular project). You can find your project name in the top 
// right of the Okareo app. 
const PROJECT_NAME = "Global"; 
const project: any[] = await okareo.getProjects(); 
const project_id = project.find(p => p.name === PROJECT_NAME)?.id; 
// Create the scenario 
const scenario: any = await okareo.create_scenario_set({
  name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
  project_id: project_id,
  seed_data: TEST_SEED_DATA
});

Tip: Alternatively, you can upload a scenario set from a file using uploadscenarioset().

  1. Run the LLM evaluation: Call Okareo's run_test() function, remembering to set the type to NL_GENERATION as this is a natural language model that you're testing and Okareo also does testing for other types of models. Also pass in any checks that you want to be done at this stage.

const eval_run: components["schemas"]["TestRunItem"] =
  await model.run_test({
    model_api_key: OPENAI_API_KEY,
    name: `${MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
    tags: [`Build:${UNIQUE_BUILD_ID}`],
    project_id: project_id,
    scenario: scenario,
    calculate_metrics: true,
    type: TestRunType.NL_GENERATION,
    checks: [
      "coherence_summary",
      "consistency_summary",
      "fluency_summary",
      "relevance_summary"
    ]
  } as RunTestProps);
  1. Set the thresholds for your checks: These will be needed for reporting purposes, to help determine whether your evaluation was a success. A common way to do this is to set a minimum threshold that each metric must meet (on average) in order to pass or fail.

const report_definition = {
  metrics_min: {
    "coherence": 4.0,
    "consistency": 4.0,
    "fluency": 4.0,
    "relevance": 4.0,
  }
};

  1. Set up reporting: Okareo's GenerationReporter gives you statistics on how well your evaluation passed each check metric and whether it passed overall.

    Usingreporter.log()will log the details of your success or failure, and print a link to a detailed online report, but you can also report whether the evaluation was a success or failure by checking if reporter.pass is true.

const reporter = new GenerationReporter({
  eval_run :eval_run,
  ...report_definition,
});
reporter.log();
  1. Handle evaluation failures and other errors: There are two types of problem that you need to make sure your code can handle:

    1. Okareo reports that an evaluation did not pass: As we will later be running this code in GitHub Actions, we can use the GitHub Actions core library to handle failures, which will work locally and in GitHub Actions.

      import * as core from "@actions/core";
      if (!reporter.pass) {
        core.setFailed("CI failed because the Okareo reporter failed.");
      }
      
      
    2. TypeScript runtime errors: It's standard practice to handle runtime errors by wrapping your code in a try/catch block. You can continue to use the GitHub Actions core library to report the error.

    try {     
      // TypeScript code that calls Okareo 
    } catch (error) {
      core.setFailed("CI failed because of an error calling Okareo: "+error.message);
    }
  2. Run your Okareo flow script: On your local machine, run the command below.

okareo run -f

This will output reporting data to the CLI from the reporter.log() function, indicating whether the evaluation has passed or failed, and printing a link that takes you to a reporting page in Okareo where you can visualize your results more easily.

Screenshot of command line output for a successful "okareo run" command.Screenshot of command line output for a failed "okareo run" command.Screenshot of the Okareo reporting page for a text summarization LLM evaluation.

Integrating Okareo into your CI workflow

Now that you've got your Okareo flow running locally, you can add it to your CI workflow. To run Okareo in CI, you will need to install the Okareo CLI on your CI server, set up your API keys as CI environment variables, and add your okareo run command to your CI workflow.

Here, we show how to do this in GitHub Actions, but the principles can be easily generalized to any other CI provider, like CircleCI, BitBucket Pipelines, or GitLab CI/CD.

Make sure you already have a GitHub repo with your Okareo project in it (the .okareo directory should be at the top level of the project).

Add secrets as environment variables: On your repository's main page, click Settings > Secrets and variables > Actions, and then New repository secret. Add secrets for OKAREO_API_KEY, OKAREO_PROJECT_ID and OPENAI_API_KEY.

Setup workflow file: Inside your repo, click the Actions tab, then on Set up a workflow as yourself. This will create a GitHub Actions workflow config file at .github/workflows/main.yml.

Add the following code to your main.yml file. The Okareo action installs the Okareo CLI before Okareo tries to run your evaluation.

name: Text summarization Okareo flow
env:
  DEMO_BUILD_ID: ${{ github.run_number }}
  OKAREO_API_KEY: ${{ secrets.OKAREO_API_KEY }}
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  OKAREO_PROJECT_ID: ${{ secrets.OKAREO_PROJECT_ID }}

on:
  push:
    branches: [ "main" ]
  pull_request:
    branches: [ "main" ]

jobs:   
  text-summarization:
    runs-on: ubuntu-latest
    defaults:
    run:
    working-directory: .
    permissions:
    contents: 'read'
    id-token: 'write'
    steps:
      - name: Checkout
      uses: actions/checkout@v4
      
      - name: Okareo Action
      uses: okareo-ai/okareo-action@v2.5
      
      - name: Text Summarization Evaluation
      
      run: |
        okareo -v
        okareo run -f text_summarization 

Now save your workflow file. This will trigger your GitHub Actions workflow to run (as will any push or pull request to your main branch).

Screenshot of the LLM evaluation running in a CI workflow (GitHub Actions), with each step checked off to show a successful run.

Okareo automates the process of LLM evaluation

Start evaluating LLM applications in CI with Okareo and ensure that your LLM-powered functionality works as expected while also saving you countless hours of manual testing. Okareo is free to use for smaller projects (of up to 5k model data points, 1k evaluated rows and 50k scenario tokens).

You can get started with Okareo immediately by signing up and installing the software. The example evaluation in this tutorial can help you get started, as well as our cookbook project which contains many working examples.

Share:

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.