RAG Evaluation in CI
Evaluation
Matt Wyman
,
CEO/Co-Founder
Sarah Barber
,
Senior Technical Content Writer
December 6, 2024
If you've already built a RAG system, whether it's for your own enterprise or as part of a customer-facing SaaS product offered to multiple tenants, you need to implement regression testing to ensure that any changes, such as newly added data or new tenants added to the system, don't cause the RAG's responses to degrade. Even if you haven't changed anything, your system might still regress because of model drift.
Integrating RAG evaluation into your CI workflow allows you to progress from a proof-of-concept RAG to a mature and stable product that you can confidently maintain in a production environment.
This article focuses on evaluating the retrieval phase of a RAG system. To understand more about evaluating the LLM part of a RAG, check out our article on LLM evaluation in CI.
The case for RAG evaluation in CI
Adding RAG evaluation to your CI workflow will give you confidence that your RAG is suitable for use in a production environment. Automating a RAG evaluation to be triggered any time a change is made to your RAG means that if any part of your RAG regresses, you'll be able to track which change caused it.
This allows you to run tests against your RAG automatically whenever you've made a change to the RAG system, including:
Changes to the intent detection system, such as new or rephrased intents being added.
Changes to the retrieval system, such as adding or removing documents from a vector DB, fine-tuning or replacing the embedding model, or changing the embeddings themselves or the algorithms used to compare their similarity.
Changes to the generation system, such as fine-tuning or re-training the LLM, or changing the system prompts.
RAG evaluation isn't just a one-and-done thing. You need to constantly re-evaluate any time new data is added or the system has changed.
When evaluating your RAG, some of the main things you might want to check for include:
Was any new data correctly added to the vector DB? You can check if any new data was loaded or indexed correctly.
Has the output of each RAG phase regressed? Or does the retrieval phase continue to return appropriate documents after new data is loaded? Have any errors been introduced?
Is there data leakage? For multi-tenant products, whenever new clients are added, you need to ensure that one customer’s data doesn't appear in another customer’s response.
Is speed and accuracy maintained as data grows? These should remain similar even when large amounts of new data is added. You may need to adjust your k value (the total number of returned embeddings) as your dataset grows in order to achieve this.
Examples of RAG evaluation that would require CI integration
Any RAG systems that are regularly updated (either with new documents being added to the vector DB, or with other changes to the system) would benefit from integrating RAG evaluation into a CI workflow. Two examples follow below.
A SaaS provider offering document search of internal engineering parts and products for different companies
Each tenant of this system has complex, large-scale databases with advanced AI search capabilities. One tenant is an engineering company called Acme Systems, and an employee of Acme Systems is looking for a specific type of sensor. They don't know its name or number but can describe it, so they enter a query into the RAG system: "Find a high-temperature sensor compatible with XYZ system, similar to what we ordered last month."
The RAG system filters out results based on metadata tags or other rules. To start with, only documents that are owned by "Acme Systems" should be returned, and not those of other tenant companies. But it also needs to ensure that only documents classified as "high temperature sensors" are retrieved. Finally the generative part of the system should return a natural language response like “The closest matches for a high-temperature sensor compatible with XYZ are Part #S12345 and Part #S23456. Both meet the specifications for high-temperature tolerance and match previous orders from last month.”
A variety of regression tests should be run whenever changes are made to this type of RAG system.
New products added: When new products are added to the vector DB, the retrieval part of the system should continue to return the most relevant documents, and the generation part should still be able to take a test query like "high-temperature sensor compatible with system XYZ" and return a response that's just as useful as before.
Changes to classification logic or metadata: If new fields are added to each product (for example, “temperature range” or “compatibility"), or new categories are added, the products should still be correctly categorized: for example, all "high temperature sensors" should be returned and no "low temperature sensors."
New tenant added: If a new tenant is added to the SaaS system, it's vital that their queries must not include other customers' results (or vice versa), as this would be confusing, not to mention leaking proprietary data. If you have separate vector databases for each tenant, then this evaluation needs to happen at the intent detection phase. However if you have only one vector DB and filter each tenant using metadata (a much cheaper option), then this becomes part of the retrieval phase.
Vector DB size increases: Over time, thousands of new products may be added, and you need to ensure that the retrieval phase continues to perform at a similar level, including accuracy, precision and recall. If these get worse, it may indicate that you need to change your k-value.
A single company's customer service question answering RAG
This company, which we'll call WebBizz, has an agent that accepts queries from customers on different topics. It has a bunch of internal policies stored as documents in a vector DB that can help inform its answers. For example, one policy might be that it will accept product returns for up to 30 days after an order was placed.
A customer, wanting to return an item, sends a query: "Can I return the blue T-shirt I recently bought?" The RAG has access to a customer database, so it's able to find out the date the customer ordered the blue T-shirt. It's able to detect that the intent is to return an item and find the most relevant policy documents relating to this.
The LLM is then able to use this policy document (along with its knowledge of the date of purchase) to determine whether a return is still permissible, and returns a natural language response such as "Yes, you can return your blue T-shirt. Please email returns@example.com with order reference number XXXXXXXX."
Whenever new policy documents are added, you should run an evaluation on the retrieval part of your RAG system, to ensure that no regressions have occurred. For example, a test query like "Can I return the blue T-shirt I recently bought?" should continue to return documents relating to returns and exchanges, with the most relevant documents being ranked at the top of the list.
How to evaluate the retrieval phase of a RAG using Okareo
This step-by-step example shows how to use Okareo to do RAG evaluation for WebBizz's customer service question answering RAG. Later, we'll cover how to add this to your CI workflow.
This example focuses on evaluating the retrieval phase of the RAG, which consists of an embedding model and a vector DB. We're using ChromaDB as the vector DB, along with its own built-in embedding model. The Okareo flow scripts are written in Python (although TypeScript is also an option).
Step 1: Install Okareo
Start by signing up for an Okareo account and installing the Okareo CLI on your local machine. This includes some environment variables, including API keys for Okareo and OpenAI (we'll be using one of their models), and your Okareo project ID, which can be found in the URL when you're logged into the app.
export OKAREO_API_KEY=<YOUR_OKAREO_API_KEY>
export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>
export OKAREO_PROJECT_ID=<YOUR_OKAREO_PROJECT_ID>
export OKAREO_PATH="<YOUR_OKAREO_PATH>
Step 2: Load the WebBizz policy documents into the vector DB
In .okareo/flows
, create a file called retrieval-eval.py
, and begin by loading the WebBizz policy documents into ChromaDB. The full code for this example is provided on our GitHub. Create metadata categories for each document, such as "Support" and "Return and exchange."
### Load documents and create corresponding metadata ###
# Import the necessary libraries
import os
from io import StringIO
import pandas as pd
# Load documents from Okareo's GitHub repository
webbizz_articles = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_10_articles.jsonl').read()
# Convert the JSONL string to a pandas DataFrame
jsonObj = pd.read_json(path_or_buf=StringIO(webbizz_articles), lines=True)
# Create rough categories for each document based on the content
# Store the categories in metadata_list
metadata_list = []
input_list = list(jsonObj.input)
for i in range(len(input_list)):
if "sustainability" in input_list[i] or "security" in input_list[i]:
metadata_list.append({"article_type": "Safety and sustainability"})
elif "support" in input_list[i] or "help" in input_list[i]:
metadata_list.append({"article_type": "Support"})
elif "return" in input_list[i] or "exchange" in input_list[i]:
metadata_list.append({"article_type": "Return and exchange"})
else:
metadata_list.append({"article_type": "Miscellaneous"})
### Create ChromaDB instance and add documents and metadata to it ###
# Import ChromaDB
import chromadb
# Create a ChromaDB client
chroma_client = chromadb.Client()
# Create a ChromaDB collection
# The collection will be used to store the documents as vector embeddings
# We want to measure the similarity between questions and documents using cosine similarity
collection = chroma_client.create_collection(name="retrieval_test", metadata={"hnsw:space": "cosine"})
# Add the documents to the collection with the corresponding metadata (the in-built embedding model converts the documents to vector embeddings). Each document ID comes from the corresponding "result" in the scenario.
collection.add(
documents=list(jsonObj.input),
ids=list(jsonObj.result),
metadatas=metadata_list
)
Step 3: Upload a set of test scenarios to Okareo
A scenario set is a series of test input questions, each paired with a list of relevant results (ordered by the most relevant) that the retrieval system should return. These scenarios need to be written by a human (in this example they were created by us at Okareo).
### Create a scenario set ###
# Import libraries
import tempfile
from okareo import Okareo
from okareo_api_client.models import TestRunType
from okareo.model_under_test import CustomModel, ModelInvocation
# Create an instance of the Okareo client
OKAREO_API_KEY = os.environ.get("OKAREO_API_KEY")
if not OKAREO_API_KEY:
raise ValueError("OKAREO_API_KEY environment variable is not set")
okareo = Okareo(OKAREO_API_KEY)
# Download questions from Okareo's GitHub repository
webbizz_retrieval_questions = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_retrieval_questions.jsonl').read()
# Save the questions to a temporary file
temp_dir = tempfile.gettempdir()
file_path = os.path.join(temp_dir, "webbizz_retrieval_questions.jsonl")
with open(file_path, "w+") as file:
file.write(webbizz_retrieval_questions)
# Upload the questions to Okareo from the temporary file
scenario = okareo.upload_scenario_set(file_path=file_path, scenario_name="Retrieval Articles Scenario")
# Clean up the temporary file
os.remove(file_path)
Step 4. Register your embedding model and vector DB with Okareo
The following code defines and registers a custom embedding model with Okareo.
When Okareo calls the custom model's invoke endpoint, it retrieves the top five most relevant results from ChromaDB (controlled by n_results=5
, though you can adjust this value based on evaluation results). The query_results_to_score
function formats the results into the JSON structure required by Okareo.
### Create custom embedding model and register it ###
# A function to convert the query results from our ChromaDB collection into a list of dictionaries with the document ID, score, metadata, and label
def query_results_to_score(results):
parsed_ids_with_scores = []
for i in range(0, len(results['distances'][0])):
# Create a score based on cosine similarity
score = (2 - results['distances'][0][i]) / 2
parsed_ids_with_scores.append(
{
"id": results['ids'][0][i],
"score": score,
"metadata": results['metadatas'][0][i],
"label": f"{results['metadatas'][0][i]['article_type']} WebBizz Article w/ ID: {results['ids'][0][i]}"
}
)
return parsed_ids_with_scores
# Define a custom retrieval model that uses the ChromaDB collection to retrieve documents
# The model will return the top five most relevant documents based on the input query
class CustomEmbeddingModel(CustomModel):
def invoke(self, input: str) -> ModelInvocation:
# Query the collection with the input text
results = collection.query(
query_texts=[input],
n_results=5
)
# Return formatted query results and the model response context
return ModelInvocation(model_prediction=query_results_to_score(results), model_output_metadata={'model_data': input})
# Register the model with Okareo
# This will return a model if it already exists or create a new one if it doesn't
model_under_test = okareo.register_model(name="vectordb_retrieval_test", model=CustomEmbeddingModel(name="custom retrieval"))
Step 5. Choose your evaluation metrics
Okareo offers a number of different retrieval metrics for RAG evaluation for you to choose between, including:
Precision: Looks at the proportion of relevant items among the top k results. If you look at the top five results and four of them are relevant, Precision@5 would be 0.8.
Mean Reciprocal Rank (MRR): Looks at the rank at which the first relevant item appears and calculates its reciprocal. For example, if the first relevant item is ranked third, the reciprocal rank is 1/3.
Normalized Discounted Cumulative Gain (NDCG): Gives higher scores to relevant documents in higher-ranked positions and diminishing scores for lower-ranked results. A high NDCG score shows good performance across varying levels of relevance.
You also need to decide which k intervals you want to evaluate for. In the example below, we've chosen k intervals from 1 to 5, but for a larger document search you may want to go with larger intervals up to k=10.
# Define thresholds for the evaluation metrics
at_k_intervals = [1, 2, 3, 4, 5]
# Choose your retrieval evaluation metrics
metrics_kwargs = {
"accuracy_at_k": at_k_intervals ,
"precision_recall_at_k": at_k_intervals ,
"ndcg_at_k": at_k_intervals,
"mrr_at_k": at_k_intervals,
"map_at_k": at_k_intervals,
}
Step 5: Add code for running your retrieval evaluation
# Import the datetime module for timestamping
from datetime import datetime
# Perform a test run using the uploaded scenario set
test_run_item = model_under_test.run_test(
scenario=scenario, # Use the scenario from the scenario set uploaded earlier
name=f"Retrieval Test Run - {datetime.now().strftime('%m-%d %H:%M:%S')}", # Add a timestamp to the test run name
test_run_type=TestRunType.INFORMATION_RETRIEVAL, # Specify that we are running an information retrieval test
calculate_metrics=True,
# Define the evaluation metrics to calculate
metrics_kwargs=metrics_kwargs
)
# Generate a link back to Okareo for evaluation visualization
app_link = test_run_item.app_link
print(f"See results in Okareo: {app_link}")
Step 6. Run your entire Okareo flow script
From your project directory, run your flow script in .okareo/flows
with the okareo run
command.
Step 7. View your RAG evaluation results
You can view the results of your RAG evaluation by clicking the link that gets printed when your flow script is run, or in the "Evaluations" tab in the Okareo app.
For more information on these results and how to interpret them, see Interpret your evaluation results in our RAG evaluation article.
How to integrate your RAG evaluation into CI
To run an Okareo flow script in a CI environment, you'll need to install Okareo on your CI server, add any API keys as CI environment variables, and add the okareo run command to your CI workflow.
Follow this step-by-step example to run your RAG evaluation in GitHub Actions. This can easily be adapted to work with other CI providers.
Start by creating a GitHub repo for your project. Make sure the .okareo folder is directly inside it with all your Okareo files below it.
Next, add your environment variables as GitHub Actions secrets by going to Settings → Secrets and variables → Actions → New repository secret. Add secrets for OKAREO_API_KEY
, OKAREO_PROJECT_ID
and OPENAI_API_KEY
.
Create a GitHub Actions workflow file by going to Actions → Set up a workflow yourself, and add the following code to the newly created main.yml file.
name: RAG evaluation Okareo flow
env:
DEMO_BUILD_ID: ${{ github.run_number }}
OKAREO_API_KEY: ${{ secrets.OKAREO_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
OKAREO_PROJECT_ID: ${{ secrets.OKAREO_PROJECT_ID }}
on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]
jobs:
rag-eval:
runs-on: ubuntu-latest
defaults:
run:
working-directory: .
permissions:
contents: 'read'
id-token: 'write'
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Okareo Action
uses: okareo-ai/okareo-action@v2.5
- name: RAG Evaluation
run: |
okareo -v
okareo run -d -f retrieval-evaluation
Now commit changes, which will cause the action to be triggered. The RAG evaluation runs and you can click the link to view the results in the Okareo app. From now on, any push or pull request to the main branch will trigger a new RAG evaluation.
Comparing the results of two CI RAG evaluations
Trigger your CI script to run twice. These two evaluation runs can now be compared in the Okareo app.
Navigate to Score Cards, select your model "vectordb_retrieval_test" and then choose the most recent evaluation run. In the next card, select the same model again, but choose the previous evaluation run. Now both can be compared side by side. You can get an overview of each metric for different values of k by using the slider, as shown below. For more detailed reporting for each evaluation, click on the name of each evaluation run "Retrieval Test Run…".
Continuously improve your RAG through RAG evaluation in CI
Incorporating RAG evaluation into your CI workflow provides ongoing assurance of the reliability of your RAG system, but it also lays the groundwork for continuous improvement. By regularly comparing evaluation runs, you can proactively make data-driven adjustments to the retrieval part of your RAG system, allowing your RAG to grow and show resilience in a production environment.
Okareo provides a framework for evaluating your RAG that can be easily integrated into CI — and it's well-placed to support you if you decide to move towards continuous deployment. Sign up to Okareo today.
If you've already built a RAG system, whether it's for your own enterprise or as part of a customer-facing SaaS product offered to multiple tenants, you need to implement regression testing to ensure that any changes, such as newly added data or new tenants added to the system, don't cause the RAG's responses to degrade. Even if you haven't changed anything, your system might still regress because of model drift.
Integrating RAG evaluation into your CI workflow allows you to progress from a proof-of-concept RAG to a mature and stable product that you can confidently maintain in a production environment.
This article focuses on evaluating the retrieval phase of a RAG system. To understand more about evaluating the LLM part of a RAG, check out our article on LLM evaluation in CI.
The case for RAG evaluation in CI
Adding RAG evaluation to your CI workflow will give you confidence that your RAG is suitable for use in a production environment. Automating a RAG evaluation to be triggered any time a change is made to your RAG means that if any part of your RAG regresses, you'll be able to track which change caused it.
This allows you to run tests against your RAG automatically whenever you've made a change to the RAG system, including:
Changes to the intent detection system, such as new or rephrased intents being added.
Changes to the retrieval system, such as adding or removing documents from a vector DB, fine-tuning or replacing the embedding model, or changing the embeddings themselves or the algorithms used to compare their similarity.
Changes to the generation system, such as fine-tuning or re-training the LLM, or changing the system prompts.
RAG evaluation isn't just a one-and-done thing. You need to constantly re-evaluate any time new data is added or the system has changed.
When evaluating your RAG, some of the main things you might want to check for include:
Was any new data correctly added to the vector DB? You can check if any new data was loaded or indexed correctly.
Has the output of each RAG phase regressed? Or does the retrieval phase continue to return appropriate documents after new data is loaded? Have any errors been introduced?
Is there data leakage? For multi-tenant products, whenever new clients are added, you need to ensure that one customer’s data doesn't appear in another customer’s response.
Is speed and accuracy maintained as data grows? These should remain similar even when large amounts of new data is added. You may need to adjust your k value (the total number of returned embeddings) as your dataset grows in order to achieve this.
Examples of RAG evaluation that would require CI integration
Any RAG systems that are regularly updated (either with new documents being added to the vector DB, or with other changes to the system) would benefit from integrating RAG evaluation into a CI workflow. Two examples follow below.
A SaaS provider offering document search of internal engineering parts and products for different companies
Each tenant of this system has complex, large-scale databases with advanced AI search capabilities. One tenant is an engineering company called Acme Systems, and an employee of Acme Systems is looking for a specific type of sensor. They don't know its name or number but can describe it, so they enter a query into the RAG system: "Find a high-temperature sensor compatible with XYZ system, similar to what we ordered last month."
The RAG system filters out results based on metadata tags or other rules. To start with, only documents that are owned by "Acme Systems" should be returned, and not those of other tenant companies. But it also needs to ensure that only documents classified as "high temperature sensors" are retrieved. Finally the generative part of the system should return a natural language response like “The closest matches for a high-temperature sensor compatible with XYZ are Part #S12345 and Part #S23456. Both meet the specifications for high-temperature tolerance and match previous orders from last month.”
A variety of regression tests should be run whenever changes are made to this type of RAG system.
New products added: When new products are added to the vector DB, the retrieval part of the system should continue to return the most relevant documents, and the generation part should still be able to take a test query like "high-temperature sensor compatible with system XYZ" and return a response that's just as useful as before.
Changes to classification logic or metadata: If new fields are added to each product (for example, “temperature range” or “compatibility"), or new categories are added, the products should still be correctly categorized: for example, all "high temperature sensors" should be returned and no "low temperature sensors."
New tenant added: If a new tenant is added to the SaaS system, it's vital that their queries must not include other customers' results (or vice versa), as this would be confusing, not to mention leaking proprietary data. If you have separate vector databases for each tenant, then this evaluation needs to happen at the intent detection phase. However if you have only one vector DB and filter each tenant using metadata (a much cheaper option), then this becomes part of the retrieval phase.
Vector DB size increases: Over time, thousands of new products may be added, and you need to ensure that the retrieval phase continues to perform at a similar level, including accuracy, precision and recall. If these get worse, it may indicate that you need to change your k-value.
A single company's customer service question answering RAG
This company, which we'll call WebBizz, has an agent that accepts queries from customers on different topics. It has a bunch of internal policies stored as documents in a vector DB that can help inform its answers. For example, one policy might be that it will accept product returns for up to 30 days after an order was placed.
A customer, wanting to return an item, sends a query: "Can I return the blue T-shirt I recently bought?" The RAG has access to a customer database, so it's able to find out the date the customer ordered the blue T-shirt. It's able to detect that the intent is to return an item and find the most relevant policy documents relating to this.
The LLM is then able to use this policy document (along with its knowledge of the date of purchase) to determine whether a return is still permissible, and returns a natural language response such as "Yes, you can return your blue T-shirt. Please email returns@example.com with order reference number XXXXXXXX."
Whenever new policy documents are added, you should run an evaluation on the retrieval part of your RAG system, to ensure that no regressions have occurred. For example, a test query like "Can I return the blue T-shirt I recently bought?" should continue to return documents relating to returns and exchanges, with the most relevant documents being ranked at the top of the list.
How to evaluate the retrieval phase of a RAG using Okareo
This step-by-step example shows how to use Okareo to do RAG evaluation for WebBizz's customer service question answering RAG. Later, we'll cover how to add this to your CI workflow.
This example focuses on evaluating the retrieval phase of the RAG, which consists of an embedding model and a vector DB. We're using ChromaDB as the vector DB, along with its own built-in embedding model. The Okareo flow scripts are written in Python (although TypeScript is also an option).
Step 1: Install Okareo
Start by signing up for an Okareo account and installing the Okareo CLI on your local machine. This includes some environment variables, including API keys for Okareo and OpenAI (we'll be using one of their models), and your Okareo project ID, which can be found in the URL when you're logged into the app.
export OKAREO_API_KEY=<YOUR_OKAREO_API_KEY>
export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>
export OKAREO_PROJECT_ID=<YOUR_OKAREO_PROJECT_ID>
export OKAREO_PATH="<YOUR_OKAREO_PATH>
Step 2: Load the WebBizz policy documents into the vector DB
In .okareo/flows
, create a file called retrieval-eval.py
, and begin by loading the WebBizz policy documents into ChromaDB. The full code for this example is provided on our GitHub. Create metadata categories for each document, such as "Support" and "Return and exchange."
### Load documents and create corresponding metadata ###
# Import the necessary libraries
import os
from io import StringIO
import pandas as pd
# Load documents from Okareo's GitHub repository
webbizz_articles = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_10_articles.jsonl').read()
# Convert the JSONL string to a pandas DataFrame
jsonObj = pd.read_json(path_or_buf=StringIO(webbizz_articles), lines=True)
# Create rough categories for each document based on the content
# Store the categories in metadata_list
metadata_list = []
input_list = list(jsonObj.input)
for i in range(len(input_list)):
if "sustainability" in input_list[i] or "security" in input_list[i]:
metadata_list.append({"article_type": "Safety and sustainability"})
elif "support" in input_list[i] or "help" in input_list[i]:
metadata_list.append({"article_type": "Support"})
elif "return" in input_list[i] or "exchange" in input_list[i]:
metadata_list.append({"article_type": "Return and exchange"})
else:
metadata_list.append({"article_type": "Miscellaneous"})
### Create ChromaDB instance and add documents and metadata to it ###
# Import ChromaDB
import chromadb
# Create a ChromaDB client
chroma_client = chromadb.Client()
# Create a ChromaDB collection
# The collection will be used to store the documents as vector embeddings
# We want to measure the similarity between questions and documents using cosine similarity
collection = chroma_client.create_collection(name="retrieval_test", metadata={"hnsw:space": "cosine"})
# Add the documents to the collection with the corresponding metadata (the in-built embedding model converts the documents to vector embeddings). Each document ID comes from the corresponding "result" in the scenario.
collection.add(
documents=list(jsonObj.input),
ids=list(jsonObj.result),
metadatas=metadata_list
)
Step 3: Upload a set of test scenarios to Okareo
A scenario set is a series of test input questions, each paired with a list of relevant results (ordered by the most relevant) that the retrieval system should return. These scenarios need to be written by a human (in this example they were created by us at Okareo).
### Create a scenario set ###
# Import libraries
import tempfile
from okareo import Okareo
from okareo_api_client.models import TestRunType
from okareo.model_under_test import CustomModel, ModelInvocation
# Create an instance of the Okareo client
OKAREO_API_KEY = os.environ.get("OKAREO_API_KEY")
if not OKAREO_API_KEY:
raise ValueError("OKAREO_API_KEY environment variable is not set")
okareo = Okareo(OKAREO_API_KEY)
# Download questions from Okareo's GitHub repository
webbizz_retrieval_questions = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_retrieval_questions.jsonl').read()
# Save the questions to a temporary file
temp_dir = tempfile.gettempdir()
file_path = os.path.join(temp_dir, "webbizz_retrieval_questions.jsonl")
with open(file_path, "w+") as file:
file.write(webbizz_retrieval_questions)
# Upload the questions to Okareo from the temporary file
scenario = okareo.upload_scenario_set(file_path=file_path, scenario_name="Retrieval Articles Scenario")
# Clean up the temporary file
os.remove(file_path)
Step 4. Register your embedding model and vector DB with Okareo
The following code defines and registers a custom embedding model with Okareo.
When Okareo calls the custom model's invoke endpoint, it retrieves the top five most relevant results from ChromaDB (controlled by n_results=5
, though you can adjust this value based on evaluation results). The query_results_to_score
function formats the results into the JSON structure required by Okareo.
### Create custom embedding model and register it ###
# A function to convert the query results from our ChromaDB collection into a list of dictionaries with the document ID, score, metadata, and label
def query_results_to_score(results):
parsed_ids_with_scores = []
for i in range(0, len(results['distances'][0])):
# Create a score based on cosine similarity
score = (2 - results['distances'][0][i]) / 2
parsed_ids_with_scores.append(
{
"id": results['ids'][0][i],
"score": score,
"metadata": results['metadatas'][0][i],
"label": f"{results['metadatas'][0][i]['article_type']} WebBizz Article w/ ID: {results['ids'][0][i]}"
}
)
return parsed_ids_with_scores
# Define a custom retrieval model that uses the ChromaDB collection to retrieve documents
# The model will return the top five most relevant documents based on the input query
class CustomEmbeddingModel(CustomModel):
def invoke(self, input: str) -> ModelInvocation:
# Query the collection with the input text
results = collection.query(
query_texts=[input],
n_results=5
)
# Return formatted query results and the model response context
return ModelInvocation(model_prediction=query_results_to_score(results), model_output_metadata={'model_data': input})
# Register the model with Okareo
# This will return a model if it already exists or create a new one if it doesn't
model_under_test = okareo.register_model(name="vectordb_retrieval_test", model=CustomEmbeddingModel(name="custom retrieval"))
Step 5. Choose your evaluation metrics
Okareo offers a number of different retrieval metrics for RAG evaluation for you to choose between, including:
Precision: Looks at the proportion of relevant items among the top k results. If you look at the top five results and four of them are relevant, Precision@5 would be 0.8.
Mean Reciprocal Rank (MRR): Looks at the rank at which the first relevant item appears and calculates its reciprocal. For example, if the first relevant item is ranked third, the reciprocal rank is 1/3.
Normalized Discounted Cumulative Gain (NDCG): Gives higher scores to relevant documents in higher-ranked positions and diminishing scores for lower-ranked results. A high NDCG score shows good performance across varying levels of relevance.
You also need to decide which k intervals you want to evaluate for. In the example below, we've chosen k intervals from 1 to 5, but for a larger document search you may want to go with larger intervals up to k=10.
# Define thresholds for the evaluation metrics
at_k_intervals = [1, 2, 3, 4, 5]
# Choose your retrieval evaluation metrics
metrics_kwargs = {
"accuracy_at_k": at_k_intervals ,
"precision_recall_at_k": at_k_intervals ,
"ndcg_at_k": at_k_intervals,
"mrr_at_k": at_k_intervals,
"map_at_k": at_k_intervals,
}
Step 5: Add code for running your retrieval evaluation
# Import the datetime module for timestamping
from datetime import datetime
# Perform a test run using the uploaded scenario set
test_run_item = model_under_test.run_test(
scenario=scenario, # Use the scenario from the scenario set uploaded earlier
name=f"Retrieval Test Run - {datetime.now().strftime('%m-%d %H:%M:%S')}", # Add a timestamp to the test run name
test_run_type=TestRunType.INFORMATION_RETRIEVAL, # Specify that we are running an information retrieval test
calculate_metrics=True,
# Define the evaluation metrics to calculate
metrics_kwargs=metrics_kwargs
)
# Generate a link back to Okareo for evaluation visualization
app_link = test_run_item.app_link
print(f"See results in Okareo: {app_link}")
Step 6. Run your entire Okareo flow script
From your project directory, run your flow script in .okareo/flows
with the okareo run
command.
Step 7. View your RAG evaluation results
You can view the results of your RAG evaluation by clicking the link that gets printed when your flow script is run, or in the "Evaluations" tab in the Okareo app.
For more information on these results and how to interpret them, see Interpret your evaluation results in our RAG evaluation article.
How to integrate your RAG evaluation into CI
To run an Okareo flow script in a CI environment, you'll need to install Okareo on your CI server, add any API keys as CI environment variables, and add the okareo run command to your CI workflow.
Follow this step-by-step example to run your RAG evaluation in GitHub Actions. This can easily be adapted to work with other CI providers.
Start by creating a GitHub repo for your project. Make sure the .okareo folder is directly inside it with all your Okareo files below it.
Next, add your environment variables as GitHub Actions secrets by going to Settings → Secrets and variables → Actions → New repository secret. Add secrets for OKAREO_API_KEY
, OKAREO_PROJECT_ID
and OPENAI_API_KEY
.
Create a GitHub Actions workflow file by going to Actions → Set up a workflow yourself, and add the following code to the newly created main.yml file.
name: RAG evaluation Okareo flow
env:
DEMO_BUILD_ID: ${{ github.run_number }}
OKAREO_API_KEY: ${{ secrets.OKAREO_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
OKAREO_PROJECT_ID: ${{ secrets.OKAREO_PROJECT_ID }}
on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]
jobs:
rag-eval:
runs-on: ubuntu-latest
defaults:
run:
working-directory: .
permissions:
contents: 'read'
id-token: 'write'
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Okareo Action
uses: okareo-ai/okareo-action@v2.5
- name: RAG Evaluation
run: |
okareo -v
okareo run -d -f retrieval-evaluation
Now commit changes, which will cause the action to be triggered. The RAG evaluation runs and you can click the link to view the results in the Okareo app. From now on, any push or pull request to the main branch will trigger a new RAG evaluation.
Comparing the results of two CI RAG evaluations
Trigger your CI script to run twice. These two evaluation runs can now be compared in the Okareo app.
Navigate to Score Cards, select your model "vectordb_retrieval_test" and then choose the most recent evaluation run. In the next card, select the same model again, but choose the previous evaluation run. Now both can be compared side by side. You can get an overview of each metric for different values of k by using the slider, as shown below. For more detailed reporting for each evaluation, click on the name of each evaluation run "Retrieval Test Run…".
Continuously improve your RAG through RAG evaluation in CI
Incorporating RAG evaluation into your CI workflow provides ongoing assurance of the reliability of your RAG system, but it also lays the groundwork for continuous improvement. By regularly comparing evaluation runs, you can proactively make data-driven adjustments to the retrieval part of your RAG system, allowing your RAG to grow and show resilience in a production environment.
Okareo provides a framework for evaluating your RAG that can be easily integrated into CI — and it's well-placed to support you if you decide to move towards continuous deployment. Sign up to Okareo today.
If you've already built a RAG system, whether it's for your own enterprise or as part of a customer-facing SaaS product offered to multiple tenants, you need to implement regression testing to ensure that any changes, such as newly added data or new tenants added to the system, don't cause the RAG's responses to degrade. Even if you haven't changed anything, your system might still regress because of model drift.
Integrating RAG evaluation into your CI workflow allows you to progress from a proof-of-concept RAG to a mature and stable product that you can confidently maintain in a production environment.
This article focuses on evaluating the retrieval phase of a RAG system. To understand more about evaluating the LLM part of a RAG, check out our article on LLM evaluation in CI.
The case for RAG evaluation in CI
Adding RAG evaluation to your CI workflow will give you confidence that your RAG is suitable for use in a production environment. Automating a RAG evaluation to be triggered any time a change is made to your RAG means that if any part of your RAG regresses, you'll be able to track which change caused it.
This allows you to run tests against your RAG automatically whenever you've made a change to the RAG system, including:
Changes to the intent detection system, such as new or rephrased intents being added.
Changes to the retrieval system, such as adding or removing documents from a vector DB, fine-tuning or replacing the embedding model, or changing the embeddings themselves or the algorithms used to compare their similarity.
Changes to the generation system, such as fine-tuning or re-training the LLM, or changing the system prompts.
RAG evaluation isn't just a one-and-done thing. You need to constantly re-evaluate any time new data is added or the system has changed.
When evaluating your RAG, some of the main things you might want to check for include:
Was any new data correctly added to the vector DB? You can check if any new data was loaded or indexed correctly.
Has the output of each RAG phase regressed? Or does the retrieval phase continue to return appropriate documents after new data is loaded? Have any errors been introduced?
Is there data leakage? For multi-tenant products, whenever new clients are added, you need to ensure that one customer’s data doesn't appear in another customer’s response.
Is speed and accuracy maintained as data grows? These should remain similar even when large amounts of new data is added. You may need to adjust your k value (the total number of returned embeddings) as your dataset grows in order to achieve this.
Examples of RAG evaluation that would require CI integration
Any RAG systems that are regularly updated (either with new documents being added to the vector DB, or with other changes to the system) would benefit from integrating RAG evaluation into a CI workflow. Two examples follow below.
A SaaS provider offering document search of internal engineering parts and products for different companies
Each tenant of this system has complex, large-scale databases with advanced AI search capabilities. One tenant is an engineering company called Acme Systems, and an employee of Acme Systems is looking for a specific type of sensor. They don't know its name or number but can describe it, so they enter a query into the RAG system: "Find a high-temperature sensor compatible with XYZ system, similar to what we ordered last month."
The RAG system filters out results based on metadata tags or other rules. To start with, only documents that are owned by "Acme Systems" should be returned, and not those of other tenant companies. But it also needs to ensure that only documents classified as "high temperature sensors" are retrieved. Finally the generative part of the system should return a natural language response like “The closest matches for a high-temperature sensor compatible with XYZ are Part #S12345 and Part #S23456. Both meet the specifications for high-temperature tolerance and match previous orders from last month.”
A variety of regression tests should be run whenever changes are made to this type of RAG system.
New products added: When new products are added to the vector DB, the retrieval part of the system should continue to return the most relevant documents, and the generation part should still be able to take a test query like "high-temperature sensor compatible with system XYZ" and return a response that's just as useful as before.
Changes to classification logic or metadata: If new fields are added to each product (for example, “temperature range” or “compatibility"), or new categories are added, the products should still be correctly categorized: for example, all "high temperature sensors" should be returned and no "low temperature sensors."
New tenant added: If a new tenant is added to the SaaS system, it's vital that their queries must not include other customers' results (or vice versa), as this would be confusing, not to mention leaking proprietary data. If you have separate vector databases for each tenant, then this evaluation needs to happen at the intent detection phase. However if you have only one vector DB and filter each tenant using metadata (a much cheaper option), then this becomes part of the retrieval phase.
Vector DB size increases: Over time, thousands of new products may be added, and you need to ensure that the retrieval phase continues to perform at a similar level, including accuracy, precision and recall. If these get worse, it may indicate that you need to change your k-value.
A single company's customer service question answering RAG
This company, which we'll call WebBizz, has an agent that accepts queries from customers on different topics. It has a bunch of internal policies stored as documents in a vector DB that can help inform its answers. For example, one policy might be that it will accept product returns for up to 30 days after an order was placed.
A customer, wanting to return an item, sends a query: "Can I return the blue T-shirt I recently bought?" The RAG has access to a customer database, so it's able to find out the date the customer ordered the blue T-shirt. It's able to detect that the intent is to return an item and find the most relevant policy documents relating to this.
The LLM is then able to use this policy document (along with its knowledge of the date of purchase) to determine whether a return is still permissible, and returns a natural language response such as "Yes, you can return your blue T-shirt. Please email returns@example.com with order reference number XXXXXXXX."
Whenever new policy documents are added, you should run an evaluation on the retrieval part of your RAG system, to ensure that no regressions have occurred. For example, a test query like "Can I return the blue T-shirt I recently bought?" should continue to return documents relating to returns and exchanges, with the most relevant documents being ranked at the top of the list.
How to evaluate the retrieval phase of a RAG using Okareo
This step-by-step example shows how to use Okareo to do RAG evaluation for WebBizz's customer service question answering RAG. Later, we'll cover how to add this to your CI workflow.
This example focuses on evaluating the retrieval phase of the RAG, which consists of an embedding model and a vector DB. We're using ChromaDB as the vector DB, along with its own built-in embedding model. The Okareo flow scripts are written in Python (although TypeScript is also an option).
Step 1: Install Okareo
Start by signing up for an Okareo account and installing the Okareo CLI on your local machine. This includes some environment variables, including API keys for Okareo and OpenAI (we'll be using one of their models), and your Okareo project ID, which can be found in the URL when you're logged into the app.
export OKAREO_API_KEY=<YOUR_OKAREO_API_KEY>
export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>
export OKAREO_PROJECT_ID=<YOUR_OKAREO_PROJECT_ID>
export OKAREO_PATH="<YOUR_OKAREO_PATH>
Step 2: Load the WebBizz policy documents into the vector DB
In .okareo/flows
, create a file called retrieval-eval.py
, and begin by loading the WebBizz policy documents into ChromaDB. The full code for this example is provided on our GitHub. Create metadata categories for each document, such as "Support" and "Return and exchange."
### Load documents and create corresponding metadata ###
# Import the necessary libraries
import os
from io import StringIO
import pandas as pd
# Load documents from Okareo's GitHub repository
webbizz_articles = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_10_articles.jsonl').read()
# Convert the JSONL string to a pandas DataFrame
jsonObj = pd.read_json(path_or_buf=StringIO(webbizz_articles), lines=True)
# Create rough categories for each document based on the content
# Store the categories in metadata_list
metadata_list = []
input_list = list(jsonObj.input)
for i in range(len(input_list)):
if "sustainability" in input_list[i] or "security" in input_list[i]:
metadata_list.append({"article_type": "Safety and sustainability"})
elif "support" in input_list[i] or "help" in input_list[i]:
metadata_list.append({"article_type": "Support"})
elif "return" in input_list[i] or "exchange" in input_list[i]:
metadata_list.append({"article_type": "Return and exchange"})
else:
metadata_list.append({"article_type": "Miscellaneous"})
### Create ChromaDB instance and add documents and metadata to it ###
# Import ChromaDB
import chromadb
# Create a ChromaDB client
chroma_client = chromadb.Client()
# Create a ChromaDB collection
# The collection will be used to store the documents as vector embeddings
# We want to measure the similarity between questions and documents using cosine similarity
collection = chroma_client.create_collection(name="retrieval_test", metadata={"hnsw:space": "cosine"})
# Add the documents to the collection with the corresponding metadata (the in-built embedding model converts the documents to vector embeddings). Each document ID comes from the corresponding "result" in the scenario.
collection.add(
documents=list(jsonObj.input),
ids=list(jsonObj.result),
metadatas=metadata_list
)
Step 3: Upload a set of test scenarios to Okareo
A scenario set is a series of test input questions, each paired with a list of relevant results (ordered by the most relevant) that the retrieval system should return. These scenarios need to be written by a human (in this example they were created by us at Okareo).
### Create a scenario set ###
# Import libraries
import tempfile
from okareo import Okareo
from okareo_api_client.models import TestRunType
from okareo.model_under_test import CustomModel, ModelInvocation
# Create an instance of the Okareo client
OKAREO_API_KEY = os.environ.get("OKAREO_API_KEY")
if not OKAREO_API_KEY:
raise ValueError("OKAREO_API_KEY environment variable is not set")
okareo = Okareo(OKAREO_API_KEY)
# Download questions from Okareo's GitHub repository
webbizz_retrieval_questions = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_retrieval_questions.jsonl').read()
# Save the questions to a temporary file
temp_dir = tempfile.gettempdir()
file_path = os.path.join(temp_dir, "webbizz_retrieval_questions.jsonl")
with open(file_path, "w+") as file:
file.write(webbizz_retrieval_questions)
# Upload the questions to Okareo from the temporary file
scenario = okareo.upload_scenario_set(file_path=file_path, scenario_name="Retrieval Articles Scenario")
# Clean up the temporary file
os.remove(file_path)
Step 4. Register your embedding model and vector DB with Okareo
The following code defines and registers a custom embedding model with Okareo.
When Okareo calls the custom model's invoke endpoint, it retrieves the top five most relevant results from ChromaDB (controlled by n_results=5
, though you can adjust this value based on evaluation results). The query_results_to_score
function formats the results into the JSON structure required by Okareo.
### Create custom embedding model and register it ###
# A function to convert the query results from our ChromaDB collection into a list of dictionaries with the document ID, score, metadata, and label
def query_results_to_score(results):
parsed_ids_with_scores = []
for i in range(0, len(results['distances'][0])):
# Create a score based on cosine similarity
score = (2 - results['distances'][0][i]) / 2
parsed_ids_with_scores.append(
{
"id": results['ids'][0][i],
"score": score,
"metadata": results['metadatas'][0][i],
"label": f"{results['metadatas'][0][i]['article_type']} WebBizz Article w/ ID: {results['ids'][0][i]}"
}
)
return parsed_ids_with_scores
# Define a custom retrieval model that uses the ChromaDB collection to retrieve documents
# The model will return the top five most relevant documents based on the input query
class CustomEmbeddingModel(CustomModel):
def invoke(self, input: str) -> ModelInvocation:
# Query the collection with the input text
results = collection.query(
query_texts=[input],
n_results=5
)
# Return formatted query results and the model response context
return ModelInvocation(model_prediction=query_results_to_score(results), model_output_metadata={'model_data': input})
# Register the model with Okareo
# This will return a model if it already exists or create a new one if it doesn't
model_under_test = okareo.register_model(name="vectordb_retrieval_test", model=CustomEmbeddingModel(name="custom retrieval"))
Step 5. Choose your evaluation metrics
Okareo offers a number of different retrieval metrics for RAG evaluation for you to choose between, including:
Precision: Looks at the proportion of relevant items among the top k results. If you look at the top five results and four of them are relevant, Precision@5 would be 0.8.
Mean Reciprocal Rank (MRR): Looks at the rank at which the first relevant item appears and calculates its reciprocal. For example, if the first relevant item is ranked third, the reciprocal rank is 1/3.
Normalized Discounted Cumulative Gain (NDCG): Gives higher scores to relevant documents in higher-ranked positions and diminishing scores for lower-ranked results. A high NDCG score shows good performance across varying levels of relevance.
You also need to decide which k intervals you want to evaluate for. In the example below, we've chosen k intervals from 1 to 5, but for a larger document search you may want to go with larger intervals up to k=10.
# Define thresholds for the evaluation metrics
at_k_intervals = [1, 2, 3, 4, 5]
# Choose your retrieval evaluation metrics
metrics_kwargs = {
"accuracy_at_k": at_k_intervals ,
"precision_recall_at_k": at_k_intervals ,
"ndcg_at_k": at_k_intervals,
"mrr_at_k": at_k_intervals,
"map_at_k": at_k_intervals,
}
Step 5: Add code for running your retrieval evaluation
# Import the datetime module for timestamping
from datetime import datetime
# Perform a test run using the uploaded scenario set
test_run_item = model_under_test.run_test(
scenario=scenario, # Use the scenario from the scenario set uploaded earlier
name=f"Retrieval Test Run - {datetime.now().strftime('%m-%d %H:%M:%S')}", # Add a timestamp to the test run name
test_run_type=TestRunType.INFORMATION_RETRIEVAL, # Specify that we are running an information retrieval test
calculate_metrics=True,
# Define the evaluation metrics to calculate
metrics_kwargs=metrics_kwargs
)
# Generate a link back to Okareo for evaluation visualization
app_link = test_run_item.app_link
print(f"See results in Okareo: {app_link}")
Step 6. Run your entire Okareo flow script
From your project directory, run your flow script in .okareo/flows
with the okareo run
command.
Step 7. View your RAG evaluation results
You can view the results of your RAG evaluation by clicking the link that gets printed when your flow script is run, or in the "Evaluations" tab in the Okareo app.
For more information on these results and how to interpret them, see Interpret your evaluation results in our RAG evaluation article.
How to integrate your RAG evaluation into CI
To run an Okareo flow script in a CI environment, you'll need to install Okareo on your CI server, add any API keys as CI environment variables, and add the okareo run command to your CI workflow.
Follow this step-by-step example to run your RAG evaluation in GitHub Actions. This can easily be adapted to work with other CI providers.
Start by creating a GitHub repo for your project. Make sure the .okareo folder is directly inside it with all your Okareo files below it.
Next, add your environment variables as GitHub Actions secrets by going to Settings → Secrets and variables → Actions → New repository secret. Add secrets for OKAREO_API_KEY
, OKAREO_PROJECT_ID
and OPENAI_API_KEY
.
Create a GitHub Actions workflow file by going to Actions → Set up a workflow yourself, and add the following code to the newly created main.yml file.
name: RAG evaluation Okareo flow
env:
DEMO_BUILD_ID: ${{ github.run_number }}
OKAREO_API_KEY: ${{ secrets.OKAREO_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
OKAREO_PROJECT_ID: ${{ secrets.OKAREO_PROJECT_ID }}
on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]
jobs:
rag-eval:
runs-on: ubuntu-latest
defaults:
run:
working-directory: .
permissions:
contents: 'read'
id-token: 'write'
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Okareo Action
uses: okareo-ai/okareo-action@v2.5
- name: RAG Evaluation
run: |
okareo -v
okareo run -d -f retrieval-evaluation
Now commit changes, which will cause the action to be triggered. The RAG evaluation runs and you can click the link to view the results in the Okareo app. From now on, any push or pull request to the main branch will trigger a new RAG evaluation.
Comparing the results of two CI RAG evaluations
Trigger your CI script to run twice. These two evaluation runs can now be compared in the Okareo app.
Navigate to Score Cards, select your model "vectordb_retrieval_test" and then choose the most recent evaluation run. In the next card, select the same model again, but choose the previous evaluation run. Now both can be compared side by side. You can get an overview of each metric for different values of k by using the slider, as shown below. For more detailed reporting for each evaluation, click on the name of each evaluation run "Retrieval Test Run…".
Continuously improve your RAG through RAG evaluation in CI
Incorporating RAG evaluation into your CI workflow provides ongoing assurance of the reliability of your RAG system, but it also lays the groundwork for continuous improvement. By regularly comparing evaluation runs, you can proactively make data-driven adjustments to the retrieval part of your RAG system, allowing your RAG to grow and show resilience in a production environment.
Okareo provides a framework for evaluating your RAG that can be easily integrated into CI — and it's well-placed to support you if you decide to move towards continuous deployment. Sign up to Okareo today.
If you've already built a RAG system, whether it's for your own enterprise or as part of a customer-facing SaaS product offered to multiple tenants, you need to implement regression testing to ensure that any changes, such as newly added data or new tenants added to the system, don't cause the RAG's responses to degrade. Even if you haven't changed anything, your system might still regress because of model drift.
Integrating RAG evaluation into your CI workflow allows you to progress from a proof-of-concept RAG to a mature and stable product that you can confidently maintain in a production environment.
This article focuses on evaluating the retrieval phase of a RAG system. To understand more about evaluating the LLM part of a RAG, check out our article on LLM evaluation in CI.
The case for RAG evaluation in CI
Adding RAG evaluation to your CI workflow will give you confidence that your RAG is suitable for use in a production environment. Automating a RAG evaluation to be triggered any time a change is made to your RAG means that if any part of your RAG regresses, you'll be able to track which change caused it.
This allows you to run tests against your RAG automatically whenever you've made a change to the RAG system, including:
Changes to the intent detection system, such as new or rephrased intents being added.
Changes to the retrieval system, such as adding or removing documents from a vector DB, fine-tuning or replacing the embedding model, or changing the embeddings themselves or the algorithms used to compare their similarity.
Changes to the generation system, such as fine-tuning or re-training the LLM, or changing the system prompts.
RAG evaluation isn't just a one-and-done thing. You need to constantly re-evaluate any time new data is added or the system has changed.
When evaluating your RAG, some of the main things you might want to check for include:
Was any new data correctly added to the vector DB? You can check if any new data was loaded or indexed correctly.
Has the output of each RAG phase regressed? Or does the retrieval phase continue to return appropriate documents after new data is loaded? Have any errors been introduced?
Is there data leakage? For multi-tenant products, whenever new clients are added, you need to ensure that one customer’s data doesn't appear in another customer’s response.
Is speed and accuracy maintained as data grows? These should remain similar even when large amounts of new data is added. You may need to adjust your k value (the total number of returned embeddings) as your dataset grows in order to achieve this.
Examples of RAG evaluation that would require CI integration
Any RAG systems that are regularly updated (either with new documents being added to the vector DB, or with other changes to the system) would benefit from integrating RAG evaluation into a CI workflow. Two examples follow below.
A SaaS provider offering document search of internal engineering parts and products for different companies
Each tenant of this system has complex, large-scale databases with advanced AI search capabilities. One tenant is an engineering company called Acme Systems, and an employee of Acme Systems is looking for a specific type of sensor. They don't know its name or number but can describe it, so they enter a query into the RAG system: "Find a high-temperature sensor compatible with XYZ system, similar to what we ordered last month."
The RAG system filters out results based on metadata tags or other rules. To start with, only documents that are owned by "Acme Systems" should be returned, and not those of other tenant companies. But it also needs to ensure that only documents classified as "high temperature sensors" are retrieved. Finally the generative part of the system should return a natural language response like “The closest matches for a high-temperature sensor compatible with XYZ are Part #S12345 and Part #S23456. Both meet the specifications for high-temperature tolerance and match previous orders from last month.”
A variety of regression tests should be run whenever changes are made to this type of RAG system.
New products added: When new products are added to the vector DB, the retrieval part of the system should continue to return the most relevant documents, and the generation part should still be able to take a test query like "high-temperature sensor compatible with system XYZ" and return a response that's just as useful as before.
Changes to classification logic or metadata: If new fields are added to each product (for example, “temperature range” or “compatibility"), or new categories are added, the products should still be correctly categorized: for example, all "high temperature sensors" should be returned and no "low temperature sensors."
New tenant added: If a new tenant is added to the SaaS system, it's vital that their queries must not include other customers' results (or vice versa), as this would be confusing, not to mention leaking proprietary data. If you have separate vector databases for each tenant, then this evaluation needs to happen at the intent detection phase. However if you have only one vector DB and filter each tenant using metadata (a much cheaper option), then this becomes part of the retrieval phase.
Vector DB size increases: Over time, thousands of new products may be added, and you need to ensure that the retrieval phase continues to perform at a similar level, including accuracy, precision and recall. If these get worse, it may indicate that you need to change your k-value.
A single company's customer service question answering RAG
This company, which we'll call WebBizz, has an agent that accepts queries from customers on different topics. It has a bunch of internal policies stored as documents in a vector DB that can help inform its answers. For example, one policy might be that it will accept product returns for up to 30 days after an order was placed.
A customer, wanting to return an item, sends a query: "Can I return the blue T-shirt I recently bought?" The RAG has access to a customer database, so it's able to find out the date the customer ordered the blue T-shirt. It's able to detect that the intent is to return an item and find the most relevant policy documents relating to this.
The LLM is then able to use this policy document (along with its knowledge of the date of purchase) to determine whether a return is still permissible, and returns a natural language response such as "Yes, you can return your blue T-shirt. Please email returns@example.com with order reference number XXXXXXXX."
Whenever new policy documents are added, you should run an evaluation on the retrieval part of your RAG system, to ensure that no regressions have occurred. For example, a test query like "Can I return the blue T-shirt I recently bought?" should continue to return documents relating to returns and exchanges, with the most relevant documents being ranked at the top of the list.
How to evaluate the retrieval phase of a RAG using Okareo
This step-by-step example shows how to use Okareo to do RAG evaluation for WebBizz's customer service question answering RAG. Later, we'll cover how to add this to your CI workflow.
This example focuses on evaluating the retrieval phase of the RAG, which consists of an embedding model and a vector DB. We're using ChromaDB as the vector DB, along with its own built-in embedding model. The Okareo flow scripts are written in Python (although TypeScript is also an option).
Step 1: Install Okareo
Start by signing up for an Okareo account and installing the Okareo CLI on your local machine. This includes some environment variables, including API keys for Okareo and OpenAI (we'll be using one of their models), and your Okareo project ID, which can be found in the URL when you're logged into the app.
export OKAREO_API_KEY=<YOUR_OKAREO_API_KEY>
export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>
export OKAREO_PROJECT_ID=<YOUR_OKAREO_PROJECT_ID>
export OKAREO_PATH="<YOUR_OKAREO_PATH>
Step 2: Load the WebBizz policy documents into the vector DB
In .okareo/flows
, create a file called retrieval-eval.py
, and begin by loading the WebBizz policy documents into ChromaDB. The full code for this example is provided on our GitHub. Create metadata categories for each document, such as "Support" and "Return and exchange."
### Load documents and create corresponding metadata ###
# Import the necessary libraries
import os
from io import StringIO
import pandas as pd
# Load documents from Okareo's GitHub repository
webbizz_articles = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_10_articles.jsonl').read()
# Convert the JSONL string to a pandas DataFrame
jsonObj = pd.read_json(path_or_buf=StringIO(webbizz_articles), lines=True)
# Create rough categories for each document based on the content
# Store the categories in metadata_list
metadata_list = []
input_list = list(jsonObj.input)
for i in range(len(input_list)):
if "sustainability" in input_list[i] or "security" in input_list[i]:
metadata_list.append({"article_type": "Safety and sustainability"})
elif "support" in input_list[i] or "help" in input_list[i]:
metadata_list.append({"article_type": "Support"})
elif "return" in input_list[i] or "exchange" in input_list[i]:
metadata_list.append({"article_type": "Return and exchange"})
else:
metadata_list.append({"article_type": "Miscellaneous"})
### Create ChromaDB instance and add documents and metadata to it ###
# Import ChromaDB
import chromadb
# Create a ChromaDB client
chroma_client = chromadb.Client()
# Create a ChromaDB collection
# The collection will be used to store the documents as vector embeddings
# We want to measure the similarity between questions and documents using cosine similarity
collection = chroma_client.create_collection(name="retrieval_test", metadata={"hnsw:space": "cosine"})
# Add the documents to the collection with the corresponding metadata (the in-built embedding model converts the documents to vector embeddings). Each document ID comes from the corresponding "result" in the scenario.
collection.add(
documents=list(jsonObj.input),
ids=list(jsonObj.result),
metadatas=metadata_list
)
Step 3: Upload a set of test scenarios to Okareo
A scenario set is a series of test input questions, each paired with a list of relevant results (ordered by the most relevant) that the retrieval system should return. These scenarios need to be written by a human (in this example they were created by us at Okareo).
### Create a scenario set ###
# Import libraries
import tempfile
from okareo import Okareo
from okareo_api_client.models import TestRunType
from okareo.model_under_test import CustomModel, ModelInvocation
# Create an instance of the Okareo client
OKAREO_API_KEY = os.environ.get("OKAREO_API_KEY")
if not OKAREO_API_KEY:
raise ValueError("OKAREO_API_KEY environment variable is not set")
okareo = Okareo(OKAREO_API_KEY)
# Download questions from Okareo's GitHub repository
webbizz_retrieval_questions = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_retrieval_questions.jsonl').read()
# Save the questions to a temporary file
temp_dir = tempfile.gettempdir()
file_path = os.path.join(temp_dir, "webbizz_retrieval_questions.jsonl")
with open(file_path, "w+") as file:
file.write(webbizz_retrieval_questions)
# Upload the questions to Okareo from the temporary file
scenario = okareo.upload_scenario_set(file_path=file_path, scenario_name="Retrieval Articles Scenario")
# Clean up the temporary file
os.remove(file_path)
Step 4. Register your embedding model and vector DB with Okareo
The following code defines and registers a custom embedding model with Okareo.
When Okareo calls the custom model's invoke endpoint, it retrieves the top five most relevant results from ChromaDB (controlled by n_results=5
, though you can adjust this value based on evaluation results). The query_results_to_score
function formats the results into the JSON structure required by Okareo.
### Create custom embedding model and register it ###
# A function to convert the query results from our ChromaDB collection into a list of dictionaries with the document ID, score, metadata, and label
def query_results_to_score(results):
parsed_ids_with_scores = []
for i in range(0, len(results['distances'][0])):
# Create a score based on cosine similarity
score = (2 - results['distances'][0][i]) / 2
parsed_ids_with_scores.append(
{
"id": results['ids'][0][i],
"score": score,
"metadata": results['metadatas'][0][i],
"label": f"{results['metadatas'][0][i]['article_type']} WebBizz Article w/ ID: {results['ids'][0][i]}"
}
)
return parsed_ids_with_scores
# Define a custom retrieval model that uses the ChromaDB collection to retrieve documents
# The model will return the top five most relevant documents based on the input query
class CustomEmbeddingModel(CustomModel):
def invoke(self, input: str) -> ModelInvocation:
# Query the collection with the input text
results = collection.query(
query_texts=[input],
n_results=5
)
# Return formatted query results and the model response context
return ModelInvocation(model_prediction=query_results_to_score(results), model_output_metadata={'model_data': input})
# Register the model with Okareo
# This will return a model if it already exists or create a new one if it doesn't
model_under_test = okareo.register_model(name="vectordb_retrieval_test", model=CustomEmbeddingModel(name="custom retrieval"))
Step 5. Choose your evaluation metrics
Okareo offers a number of different retrieval metrics for RAG evaluation for you to choose between, including:
Precision: Looks at the proportion of relevant items among the top k results. If you look at the top five results and four of them are relevant, Precision@5 would be 0.8.
Mean Reciprocal Rank (MRR): Looks at the rank at which the first relevant item appears and calculates its reciprocal. For example, if the first relevant item is ranked third, the reciprocal rank is 1/3.
Normalized Discounted Cumulative Gain (NDCG): Gives higher scores to relevant documents in higher-ranked positions and diminishing scores for lower-ranked results. A high NDCG score shows good performance across varying levels of relevance.
You also need to decide which k intervals you want to evaluate for. In the example below, we've chosen k intervals from 1 to 5, but for a larger document search you may want to go with larger intervals up to k=10.
# Define thresholds for the evaluation metrics
at_k_intervals = [1, 2, 3, 4, 5]
# Choose your retrieval evaluation metrics
metrics_kwargs = {
"accuracy_at_k": at_k_intervals ,
"precision_recall_at_k": at_k_intervals ,
"ndcg_at_k": at_k_intervals,
"mrr_at_k": at_k_intervals,
"map_at_k": at_k_intervals,
}
Step 5: Add code for running your retrieval evaluation
# Import the datetime module for timestamping
from datetime import datetime
# Perform a test run using the uploaded scenario set
test_run_item = model_under_test.run_test(
scenario=scenario, # Use the scenario from the scenario set uploaded earlier
name=f"Retrieval Test Run - {datetime.now().strftime('%m-%d %H:%M:%S')}", # Add a timestamp to the test run name
test_run_type=TestRunType.INFORMATION_RETRIEVAL, # Specify that we are running an information retrieval test
calculate_metrics=True,
# Define the evaluation metrics to calculate
metrics_kwargs=metrics_kwargs
)
# Generate a link back to Okareo for evaluation visualization
app_link = test_run_item.app_link
print(f"See results in Okareo: {app_link}")
Step 6. Run your entire Okareo flow script
From your project directory, run your flow script in .okareo/flows
with the okareo run
command.
Step 7. View your RAG evaluation results
You can view the results of your RAG evaluation by clicking the link that gets printed when your flow script is run, or in the "Evaluations" tab in the Okareo app.
For more information on these results and how to interpret them, see Interpret your evaluation results in our RAG evaluation article.
How to integrate your RAG evaluation into CI
To run an Okareo flow script in a CI environment, you'll need to install Okareo on your CI server, add any API keys as CI environment variables, and add the okareo run command to your CI workflow.
Follow this step-by-step example to run your RAG evaluation in GitHub Actions. This can easily be adapted to work with other CI providers.
Start by creating a GitHub repo for your project. Make sure the .okareo folder is directly inside it with all your Okareo files below it.
Next, add your environment variables as GitHub Actions secrets by going to Settings → Secrets and variables → Actions → New repository secret. Add secrets for OKAREO_API_KEY
, OKAREO_PROJECT_ID
and OPENAI_API_KEY
.
Create a GitHub Actions workflow file by going to Actions → Set up a workflow yourself, and add the following code to the newly created main.yml file.
name: RAG evaluation Okareo flow
env:
DEMO_BUILD_ID: ${{ github.run_number }}
OKAREO_API_KEY: ${{ secrets.OKAREO_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
OKAREO_PROJECT_ID: ${{ secrets.OKAREO_PROJECT_ID }}
on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]
jobs:
rag-eval:
runs-on: ubuntu-latest
defaults:
run:
working-directory: .
permissions:
contents: 'read'
id-token: 'write'
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Okareo Action
uses: okareo-ai/okareo-action@v2.5
- name: RAG Evaluation
run: |
okareo -v
okareo run -d -f retrieval-evaluation
Now commit changes, which will cause the action to be triggered. The RAG evaluation runs and you can click the link to view the results in the Okareo app. From now on, any push or pull request to the main branch will trigger a new RAG evaluation.
Comparing the results of two CI RAG evaluations
Trigger your CI script to run twice. These two evaluation runs can now be compared in the Okareo app.
Navigate to Score Cards, select your model "vectordb_retrieval_test" and then choose the most recent evaluation run. In the next card, select the same model again, but choose the previous evaluation run. Now both can be compared side by side. You can get an overview of each metric for different values of k by using the slider, as shown below. For more detailed reporting for each evaluation, click on the name of each evaluation run "Retrieval Test Run…".
Continuously improve your RAG through RAG evaluation in CI
Incorporating RAG evaluation into your CI workflow provides ongoing assurance of the reliability of your RAG system, but it also lays the groundwork for continuous improvement. By regularly comparing evaluation runs, you can proactively make data-driven adjustments to the retrieval part of your RAG system, allowing your RAG to grow and show resilience in a production environment.
Okareo provides a framework for evaluating your RAG that can be easily integrated into CI — and it's well-placed to support you if you decide to move towards continuous deployment. Sign up to Okareo today.