How to add LLM Evaluation to your CI workflow

Evaluation

Matt Wyman
Co-founder of Okareo
Sarah Barber
Senior Technical Content Writer
June 13, 2024
Okareo is a large language model (LLM) evaluation tool that allows you to integrate LLM evaluation into your CI workflow. We show how to get Okareo working in your CI.
When building a product that uses third-party LLMs (such as those available in OpenAI), it's important to regularly evaluate that these LLMs still work as expected, especially when you update a model or other parts of your software. However, testing LLMs isn't as straightforward as testing regular application code.
Application code testing can be easily automated, by creating unit tests and running them as part of your continuous integration (CI) pipelines, so that every time your code gets updated, the tests can be re-run by the CI system against the new code. But as LLMs have non-deterministic outputs, it's impossible to write unit tests for the parts of your app that interact with LLMs, as the output will change each time you run the test. Until now, manual testing has been the only real way to ensure the accuracy of LLMs as they continue to develop.
Okareo allows you to skip tedious work on manual tests by offering a way to automate tests for non-deterministic systems like LLMs and other AI models. Using Okareo it's now possible to integrate your automated LLM evaluations into your CI workflow.
What is LLM evaluation?
LLM evaluation is a way to measure how good an LLM is at performing certain tasks — such as text completion, summarization, or question answering. It assesses the performance of an LLM by comparing the model's output against some expected results to check things like accuracy and relevancy.
Without LLM evaluation, you won't be able to check that your model can be generalized beyond the training data you've given it. As LLMs are now being integrated into app development, it's important to have confidence in your model at all times – hence the need to add an LLM evaluation step to your CI workflow.
What is Okareo?
Okareo is a tool for developers that can evaluate the output of the LLMs that power your AI apps. You can write tests that evaluate your LLM in either TypeScript or Python, and it has a CLI interface for running all your LLM evaluations, meaning it can be easily integrated into your CI workflow, no matter which CI provider you use.
Okareo's LLM evaluation process involves comparing a scenario (consisting of inputs to a model along with their corresponding expected results) with the actual results of an LLM given the same inputs.
Okareo treats the LLM as a black box — it's simply interested in whether the output data of the LLM conforms to certain rules or standards, known as checks. Checks are a way to compare the similarity of the expected and actual results according to specific metrics like consistency or relevance.
You can either use Okareo's pre-defined checks or create your own custom ones. Some checks are measured as a simple pass or fail, and others are scored with a range (such as 1-5). For those with a range, you can decide the threshold at which it passes or fails. For example, you might choose to set the minimum pass threshold of relevance to 4 out of 5. By default, all checks must pass in order for your LLM evaluation CI pipeline to pass overall, although this can be overridden with a property called [error_max](https://okareo.com/docs/sdk/okareo_typescript), which specifies how many errors you’re willing to tolerate and still pass.

How to use Okareo to evaluate an LLM-powered application or feature in CI
The main steps involved are to create an LLM, write evaluation tests for the LLM using Okareo's TypeScript or Python libraries, and finally run those tests as part of a CI workflow using the Okareo CLI. To follow along with this tutorial, there is no need to create your own LLM as we'll be evaluating an existing one (gpt-3.5-turbo on OpenAI).
First, you'll need to sign up to Okareo and OpenAI and get API keys for each account.
This example uses the Okareo TypeScript SDK, but Python is also available.
Running an Okareo flow locally
An Okareo flow is a TypeScript or Python script that calls different parts of the Okareo API to complete an Okareo evaluation. Below are the instructions you'll need to get a flow running locally. We'll be using a text summarization model for our example, and we’ve shared the entire project on GitHub with you to make it easier to follow along.
Install the Okareo CLI: Install this first on your local machine so you can test your Okareo evaluations locally before integrating them into your CI. The installation instructions describe how to set Okareo's language to TypeScript and create a file structure for your Okareo flows. Once you've finished following these instructions you should have a directory structure as follows:
Install dependencies for Okareo and OpenAI: You can install the required packages with
Next, set your API keys and Okareo project ID environment variables so they can be used in
<your_flow_script>.tsandconfig.yml.Register your model with Okareo: In this case, we're using Open AI's
gpt-3.5-turbomodel. For this example, we're asking the model to summarize text that is sent to it into a single sentence.
Create your scenario: Your scenario is a set of inputs and expected results. Scenarios can either be manually created (known as a seed scenario) or generated from previous scenarios. In this example, each result is an ID of a particular document in the
gpt-3.5-turbovector database. The documents have already been manually verified as being acceptable results for the input data.
Tip: Alternatively, you can upload a scenario set from a file using uploadscenarioset().
Run the LLM evaluation: Call Okareo's
run_test()function, remembering to set the type toNL_GENERATIONas this is a natural language model that you're testing and Okareo also does testing for other types of models. Also pass in any checks that you want to be done at this stage.
Set the thresholds for your checks: These will be needed for reporting purposes, to help determine whether your evaluation was a success. A common way to do this is to set a minimum threshold that each metric must meet (on average) in order to pass or fail.
Set up reporting: Okareo's GenerationReporter gives you statistics on how well your evaluation passed each check metric and whether it passed overall.
Using
reporter.log()will log the details of your success or failure, and print a link to a detailed online report, but you can also report whether the evaluation was a success or failure by checking ifreporter.passis true.
Handle evaluation failures and other errors: There are two types of problem that you need to make sure your code can handle:
Okareo reports that an evaluation did not pass: As we will later be running this code in GitHub Actions, we can use the GitHub Actions core library to handle failures, which will work locally and in GitHub Actions.
TypeScript runtime errors: It's standard practice to handle runtime errors by wrapping your code in a try/catch block. You can continue to use the GitHub Actions core library to report the error.
Run your Okareo flow script: On your local machine, run the command below.
This will output reporting data to the CLI from the reporter.log() function, indicating whether the evaluation has passed or failed, and printing a link that takes you to a reporting page in Okareo where you can visualize your results more easily.



Integrating Okareo into your CI workflow
Now that you've got your Okareo flow running locally, you can add it to your CI workflow. To run Okareo in CI, you will need to install the Okareo CLI on your CI server, set up your API keys as CI environment variables, and add your okareo run command to your CI workflow.
Here, we show how to do this in GitHub Actions, but the principles can be easily generalized to any other CI provider, like CircleCI, BitBucket Pipelines, or GitLab CI/CD.
Make sure you already have a GitHub repo with your Okareo project in it (the .okareo directory should be at the top level of the project).
Add secrets as environment variables: On your repository's main page, click Settings > Secrets and variables > Actions, and then New repository secret. Add secrets for OKAREO_API_KEY, OKAREO_PROJECT_ID and OPENAI_API_KEY.
Setup workflow file: Inside your repo, click the Actions tab, then on Set up a workflow as yourself. This will create a GitHub Actions workflow config file at .github/workflows/main.yml.
Add the following code to your main.yml file. The Okareo action installs the Okareo CLI before Okareo tries to run your evaluation.
Now save your workflow file. This will trigger your GitHub Actions workflow to run (as will any push or pull request to your main branch).

Okareo automates the process of LLM evaluation
Start evaluating LLM applications in CI with Okareo and ensure that your LLM-powered functionality works as expected while also saving you countless hours of manual testing. Okareo is free to use for smaller projects (of up to 5k model data points, 1k evaluated rows and 50k scenario tokens).
You can get started with Okareo immediately by signing up and installing the software. The example evaluation in this tutorial can help you get started, as well as our cookbook project which contains many working examples.




