RAG Evaluation in CI

Evaluation

Matt Wyman
CEO/Co-Founder
Sarah Barber
Senior Technical Content Writer
December 6, 2024
If you've already built a RAG system, whether it's for your own enterprise or as part of a customer-facing SaaS product offered to multiple tenants, you need to implement regression testing to ensure that any changes, such as newly added data or new tenants added to the system, don't cause the RAG's responses to degrade. Even if you haven't changed anything, your system might still regress because of model drift.
Integrating RAG evaluation into your CI workflow allows you to progress from a proof-of-concept RAG to a mature and stable product that you can confidently maintain in a production environment.
This article focuses on evaluating the retrieval phase of a RAG system. To understand more about evaluating the LLM part of a RAG, check out our article on LLM evaluation in CI.
The case for RAG evaluation in CI
Adding RAG evaluation to your CI workflow will give you confidence that your RAG is suitable for use in a production environment. Automating a RAG evaluation to be triggered any time a change is made to your RAG means that if any part of your RAG regresses, you'll be able to track which change caused it.
This allows you to run tests against your RAG automatically whenever you've made a change to the RAG system, including:
Changes to the intent detection system, such as new or rephrased intents being added.
Changes to the retrieval system, such as adding or removing documents from a vector DB, fine-tuning or replacing the embedding model, or changing the embeddings themselves or the algorithms used to compare their similarity.
Changes to the generation system, such as fine-tuning or re-training the LLM, or changing the system prompts.
RAG evaluation isn't just a one-and-done thing. You need to constantly re-evaluate any time new data is added or the system has changed.
When evaluating your RAG, some of the main things you might want to check for include:
Was any new data correctly added to the vector DB? You can check if any new data was loaded or indexed correctly.
Has the output of each RAG phase regressed? Or does the retrieval phase continue to return appropriate documents after new data is loaded? Have any errors been introduced?
Is there data leakage? For multi-tenant products, whenever new clients are added, you need to ensure that one customer’s data doesn't appear in another customer’s response.
Is speed and accuracy maintained as data grows? These should remain similar even when large amounts of new data is added. You may need to adjust your k value (the total number of returned embeddings) as your dataset grows in order to achieve this.
Examples of RAG evaluation that would require CI integration
Any RAG systems that are regularly updated (either with new documents being added to the vector DB, or with other changes to the system) would benefit from integrating RAG evaluation into a CI workflow. Two examples follow below.
A SaaS provider offering document search of internal engineering parts and products for different companies
Each tenant of this system has complex, large-scale databases with advanced AI search capabilities. One tenant is an engineering company called Acme Systems, and an employee of Acme Systems is looking for a specific type of sensor. They don't know its name or number but can describe it, so they enter a query into the RAG system: "Find a high-temperature sensor compatible with XYZ system, similar to what we ordered last month."
The RAG system filters out results based on metadata tags or other rules. To start with, only documents that are owned by "Acme Systems" should be returned, and not those of other tenant companies. But it also needs to ensure that only documents classified as "high temperature sensors" are retrieved. Finally the generative part of the system should return a natural language response like “The closest matches for a high-temperature sensor compatible with XYZ are Part #S12345 and Part #S23456. Both meet the specifications for high-temperature tolerance and match previous orders from last month.”
A variety of regression tests should be run whenever changes are made to this type of RAG system.
New products added: When new products are added to the vector DB, the retrieval part of the system should continue to return the most relevant documents, and the generation part should still be able to take a test query like "high-temperature sensor compatible with system XYZ" and return a response that's just as useful as before.
Changes to classification logic or metadata: If new fields are added to each product (for example, “temperature range” or “compatibility"), or new categories are added, the products should still be correctly categorized: for example, all "high temperature sensors" should be returned and no "low temperature sensors."
New tenant added: If a new tenant is added to the SaaS system, it's vital that their queries must not include other customers' results (or vice versa), as this would be confusing, not to mention leaking proprietary data. If you have separate vector databases for each tenant, then this evaluation needs to happen at the intent detection phase. However if you have only one vector DB and filter each tenant using metadata (a much cheaper option), then this becomes part of the retrieval phase.
Vector DB size increases: Over time, thousands of new products may be added, and you need to ensure that the retrieval phase continues to perform at a similar level, including accuracy, precision and recall. If these get worse, it may indicate that you need to change your k-value.
A single company's customer service question answering RAG
This company, which we'll call WebBizz, has an agent that accepts queries from customers on different topics. It has a bunch of internal policies stored as documents in a vector DB that can help inform its answers. For example, one policy might be that it will accept product returns for up to 30 days after an order was placed.
A customer, wanting to return an item, sends a query: "Can I return the blue T-shirt I recently bought?" The RAG has access to a customer database, so it's able to find out the date the customer ordered the blue T-shirt. It's able to detect that the intent is to return an item and find the most relevant policy documents relating to this.
The LLM is then able to use this policy document (along with its knowledge of the date of purchase) to determine whether a return is still permissible, and returns a natural language response such as "Yes, you can return your blue T-shirt. Please email returns@example.com with order reference number XXXXXXXX."
Whenever new policy documents are added, you should run an evaluation on the retrieval part of your RAG system, to ensure that no regressions have occurred. For example, a test query like "Can I return the blue T-shirt I recently bought?" should continue to return documents relating to returns and exchanges, with the most relevant documents being ranked at the top of the list.
How to evaluate the retrieval phase of a RAG using Okareo
This step-by-step example shows how to use Okareo to do RAG evaluation for WebBizz's customer service question answering RAG. Later, we'll cover how to add this to your CI workflow.
This example focuses on evaluating the retrieval phase of the RAG, which consists of an embedding model and a vector DB. We're using ChromaDB as the vector DB, along with its own built-in embedding model. The Okareo flow scripts are written in Python (although TypeScript is also an option).
Step 1: Install Okareo
Start by signing up for an Okareo account and installing the Okareo CLI on your local machine. This includes some environment variables, including API keys for Okareo and OpenAI (we'll be using one of their models), and your Okareo project ID, which can be found in the URL when you're logged into the app.
Step 2: Load the WebBizz policy documents into the vector DB
In .okareo/flows, create a file called retrieval-eval.py, and begin by loading the WebBizz policy documents into ChromaDB. The full code for this example is provided on our GitHub. Create metadata categories for each document, such as "Support" and "Return and exchange."
Step 3: Upload a set of test scenarios to Okareo
A scenario set is a series of test input questions, each paired with a list of relevant results (ordered by the most relevant) that the retrieval system should return. These scenarios need to be written by a human (in this example they were created by us at Okareo).
Step 4. Register your embedding model and vector DB with Okareo
The following code defines and registers a custom embedding model with Okareo.
When Okareo calls the custom model's invoke endpoint, it retrieves the top five most relevant results from ChromaDB (controlled by n_results=5, though you can adjust this value based on evaluation results). The query_results_to_score function formats the results into the JSON structure required by Okareo.
Step 5. Choose your evaluation metrics
Okareo offers a number of different retrieval metrics for RAG evaluation for you to choose between, including:
Precision: Looks at the proportion of relevant items among the top k results. If you look at the top five results and four of them are relevant, Precision@5 would be 0.8.
Mean Reciprocal Rank (MRR): Looks at the rank at which the first relevant item appears and calculates its reciprocal. For example, if the first relevant item is ranked third, the reciprocal rank is 1/3.
Normalized Discounted Cumulative Gain (NDCG): Gives higher scores to relevant documents in higher-ranked positions and diminishing scores for lower-ranked results. A high NDCG score shows good performance across varying levels of relevance.
You also need to decide which k intervals you want to evaluate for. In the example below, we've chosen k intervals from 1 to 5, but for a larger document search you may want to go with larger intervals up to k=10.
Step 5: Add code for running your retrieval evaluation
Step 6. Run your entire Okareo flow script
From your project directory, run your flow script in .okareo/flows with the okareo run command.
Step 7. View your RAG evaluation results
You can view the results of your RAG evaluation by clicking the link that gets printed when your flow script is run, or in the "Evaluations" tab in the Okareo app.

For more information on these results and how to interpret them, see Interpret your evaluation results in our RAG evaluation article.
How to integrate your RAG evaluation into CI
To run an Okareo flow script in a CI environment, you'll need to install Okareo on your CI server, add any API keys as CI environment variables, and add the okareo run command to your CI workflow.
Follow this step-by-step example to run your RAG evaluation in GitHub Actions. This can easily be adapted to work with other CI providers.
Start by creating a GitHub repo for your project. Make sure the .okareo folder is directly inside it with all your Okareo files below it.
Next, add your environment variables as GitHub Actions secrets by going to Settings → Secrets and variables → Actions → New repository secret. Add secrets for OKAREO_API_KEY, OKAREO_PROJECT_ID and OPENAI_API_KEY.
Create a GitHub Actions workflow file by going to Actions → Set up a workflow yourself, and add the following code to the newly created main.yml file.
Now commit changes, which will cause the action to be triggered. The RAG evaluation runs and you can click the link to view the results in the Okareo app. From now on, any push or pull request to the main branch will trigger a new RAG evaluation.

Comparing the results of two CI RAG evaluations
Trigger your CI script to run twice. These two evaluation runs can now be compared in the Okareo app.
Navigate to Score Cards, select your model "vectordb_retrieval_test" and then choose the most recent evaluation run. In the next card, select the same model again, but choose the previous evaluation run. Now both can be compared side by side. You can get an overview of each metric for different values of k by using the slider, as shown below. For more detailed reporting for each evaluation, click on the name of each evaluation run "Retrieval Test Run…".

Continuously improve your RAG through RAG evaluation in CI
Incorporating RAG evaluation into your CI workflow provides ongoing assurance of the reliability of your RAG system, but it also lays the groundwork for continuous improvement. By regularly comparing evaluation runs, you can proactively make data-driven adjustments to the retrieval part of your RAG system, allowing your RAG to grow and show resilience in a production environment.
Okareo provides a framework for evaluating your RAG that can be easily integrated into CI — and it's well-placed to support you if you decide to move towards continuous deployment. Sign up to Okareo today.




