How to Do RAG Evaluation with Okareo: A Step-by-Step Guide

RAG

Matt Wyman

,

CEO/Co-Founder

Sarah Barber

,

Senior Technical Content Writer

November 15, 2024

Retrieval Augmented Generation (RAG) has surged in popularity in the last year, and it is now the foundational architecture for building robust LLM applications. RAG is the fundamental approach for integrating external data into your LLM workflows: just as it's hard to imagine a web application without a database, it's now becoming hard to imagine an LLM without RAG.

RAG combines a retrieval system with an LLM (and sometimes other components such as a classification model) to facilitate dynamic, data-driven responses from the LLM, which is fast becoming indispensable for most modern-use cases.

In this guide, we show how Okareo, a RAG evaluation tool, can be used to evaluate each component of a RAG architecture.

What is RAG and when should you use it?

RAG is an architecture that initially consisted of two components: a retrieval system (which is responsible for finding relevant information from an external data source based on a user query) and an LLM. 

Almost all LLM-powered apps will incorporate a RAG these days, as RAG allows your LLM to access up-to-date or contextually relevant sources of information. Without RAG, it's hard to create an AI app that adds value.

Today, many RAG systems include multiple retrieval systems, meaning they also need a classifier to work out the user's intent behind their query and route it to the most relevant retrieval system. For example, in a customer service RAG system, a user query about technical documentation might be routed to a retrieval system with a knowledge base as the data source, whereas a general customer service question might be routed to a retrieval system with an FAQ database as the data source.

Many retrieval systems store their data as vectors in an external vector database. When a user queries the RAG system, the query is passed to the vector database, which uses a similarity search algorithm to return the documents that are closest to the original user query. These documents are later fed into your LLM.

It's also common to use a classifier to help filter your vector database results. Vector databases can store metadata alongside each document, such as category or date, which allows for easy filtering of results when the vector DB is queried, making your RAG more efficient.

For more information on what RAG is, see our RAG architecture article.

What is RAG evaluation?

RAG evaluation assesses the performance of a RAG system, which consists of multiple components that run in a specified order. Each component needs to be evaluated separately, as issues in earlier stages can have downstream effects on later components.

The components of a RAG system that need evaluating include:

  • Classifier: Evaluated using metrics like precision, recall, and F1 score to determine how well it's routing queries to the right retrieval systems (or other components).

  • Retrieval system: Includes multiple subcomponents such as a vector database, embedding model, rerank model, and result filtering. Each of these components can be evaluated as a whole or separately.

  • LLM: Assessed for accuracy, relevance, and coherence of the generated response. May include standard metrics like BLEU score or other types of scores, such as friendliness, that can be evaluated either by a human or another LLM acting as a judge.

How to evaluate your RAG model using Okareo

Okareo provides tools to evaluate each component of a RAG system, including classification models, the entire retrieval pipeline, and LLMs. You can create scenarios and run evaluations using the Okareo app or programmatically using Okareo's Python or TypeScript SDKs. This latter method is the recommended way to do it once you want to start chaining together the evaluation of multiple parts of your RAG and running the same evaluations on a regular basis.

In this guide, we’ll focus on evaluating the retrieval system, as it’s typically the most complex part of a RAG system to evaluate. Okareo has different metrics for evaluating each part of the retrieval system (vector database, embedding model, rerank model, result filtering etc.) helping you evaluate different things like precision, ranking accuracy, and relevance.

If you want to see how to evaluate other parts of a RAG system, we offer guides on evaluating a classifier using Okareo (which is the simplest component to evaluate) and on evaluating LLMs.

RAG Evaluation: The retrieval system

Most RAG retrieval systems consist of an embedding model and a vector database as the core architecture, but the most optimized and efficient RAG systems usually also include a rerank model and result filtering.

  • Embedding model: An embedding model converts input text (queries, documents, etc.) into vector embeddings, a type of vector that captures semantic meanings.The embedding is then sent as a query to the vector database.

  • Vector database: Stores external data that has already been converted into vector embeddings. When queried by the embedding model, the vector DB does a similarity search to find documents with semantic similarity to the vector embedding query (and by extension, the original text) and returns a shortlist of the IDs of the most similar documents, plus the corresponding text.

  • Reranker: This extra layer is often required to refine the ranking of the documents returned by the vector DB. The documents get re-ranked according to a deeper, more context-aware analysis of their relevance to the original query, which generally improves the quality of your results. The reranker will return the top "k" most relevant results, where k is a value you specify. Evaluating your retrieval system as you build your RAG will help you determine the best value of k. More details will be given on this later.

  • Result filtering: This refines your results even more by applying metadata-based filters, such as date, category, or document type to your query. For example, you might choose to filter your results by topic to help match the user's intent, or you might filter out any documents older than a certain date (like in a news retrieval system where recent results are more important).

Evaluating each of these components helps ensure that the retrieval part of your RAG system is not holding back your LLM farther down the pipeline.

Retrieval evaluation: Step-by-step tutorial

Step 1 (pre-evaluation): Fill your vector database with data

For this tutorial, we provide full example data so you can follow along, and the full code example is available on GitHub.

The example assumes you've downloaded the Okareo CLI and have followed the instructions to export environment variables and to initialize an Okareo project, meaning the code examples below belong in a file like <your_flow_script>.py.

The example consists of a customer service question-answering RAG system for a sample company called WebBizz. The code below reads some WebBizz knowledge base articles, adds categories to them as metadata, and saves the documents and metadata in ChromaDB, a free, open-source vector DB that's easy to use and doesn't require an account. ChromaDB has a default embedding model built into it, which gets used when you add data to it or query it.

### Load documents and create corresponding metadata to be later added to vector DB ###
# Import the necessary libraries
import os
from io import StringIO  
import pandas as pd
# Load documents from Okareo's GitHub repository
webbizz_articles = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_10_articles.jsonl').read()
# Convert the JSONL string to a pandas DataFrame
jsonObj = pd.read_json(path_or_buf=StringIO(webbizz_articles), lines=True)
# Create rough categories for each document based on the content
# Store the categories in metadata_list
metadata_list = []
input_list = list(jsonObj.input)
for i in range(len(input_list)):
    if "sustainability" in input_list[i] or "security" in list(input_list[i]):
        metadata_list.append({"article_type": "Safety and sustainability"})
    elif "support" in input_list[i] or "help" in list(input_list[i]):
        metadata_list.append({"article_type": "Support"})
    elif "return" in input_list[i]:
        metadata_list.append({"article_type": "Return and exchange"})
    else:
        metadata_list.append({"article_type": "Miscellaneous"})
### Create ChromaDB instance and add documents and metadata to it ###
# Import ChromaDB
import chromadb
# Create a ChromaDB client
chroma_client = chromadb.Client()
# Create a ChromaDB collection
# The collection will be used to store the documents as vector embeddings
# We want to measure the similarity between questions and documents using cosine similarity
collection = chroma_client.create_collection(name="retrieval_test", metadata={"hnsw:space": "cosine"})
# Add the documents to the collection with the corresponding metadata (the in-built embedding model converts the documents to vector embeddings)
collection.add(
    documents=list(jsonObj.input),
    ids=list(jsonObj.result),
    metadatas=metadata_list
)

Step 2: Create a scenario set

A scenario set is a set of example input queries to the retrieval system along with their expected results, which in this case is a list of IDs of the most relevant documents in the vector DB. 

We provide an example scenario set for the WebBizz question-answering RAG system. This includes inputs (example question) and the corresponding expected results (which are determined by a subject matter expert).

### Create a scenario set ###
# Import libraries
import tempfile
from okareo import Okareo
from okareo_api_client.models import TestRunType
from okareo.model_under_test import CustomModel, ModelInvocation
# Create an instance of the Okareo client
OKAREO_API_KEY = os.environ.get("OKAREO_API_KEY")
if not OKAREO_API_KEY:
    raise ValueError("OKAREO_API_KEY environment variable is not set")
okareo = Okareo(OKAREO_API_KEY)
# Download questions from Okareo's GitHub repository
webbizz_retrieval_questions = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_retrieval_questions.jsonl').read()
# Save the questions to a temporary file
temp_dir = tempfile.gettempdir()
file_path = os.path.join(temp_dir, "webbizz_retrieval_questions.jsonl")
with open(file_path, "w+") as file:
    file.write(webbizz_retrieval_questions)
# Upload the questions to Okareo from the temporary file
scenario = okareo.upload_scenario_set(file_path=file_path, scenario_name="Retrieval Articles Scenario")
# Clean up the temporary file
os.remove(file_path)

Step 3: Register the embedding model and vector DB with Okareo

Okareo supports any embedding model or vector DB. Some of these have direct built-in support such as Pinecone and QDrant vector DBs, and the Cohere embedding model, but any can be supported through our CustomModel class. 

In this example we're using ChromaDB and its built-in embedding model by creating a new class by using Okareo's CustomModel class as a base. The code below shows a custom embedding model being defined and registered with Okareo. 

Later, when Okareo calls the custom model's invoke endpoint, this will return the top 5 most relevant results from ChromaDB (this is specified with n_results=5, but you may choose to change this number according to your evaluation results). The query_results_to_score function converts the results into the JSON format that Okareo expects.

### Create custom embedding model and register it ###
# A function to convert the query results from our ChromaDB collection into a list of dictionaries with the document ID, score, metadata, and label
def query_results_to_score(results):
    parsed_ids_with_scores = []
    for i in range(0, len(results['distances'][0])):
        # Create a score based on cosine similarity
        score = (2 - results['distances'][0][i]) / 2
        parsed_ids_with_scores.append(
            {
                "id": results['ids'][0][i],
                "score": score,
                "metadata": results['metadatas'][0][i],
                "label": f"{results['metadatas'][0][i]['article_type']} WebBizz Article w/ ID: {results['ids'][0][i]}"
            }
        )
    return parsed_ids_with_scores
# Define a custom retrieval model that uses the ChromaDB collection to retrieve documents
# The model will return the top 5 most relevant documents based on the input query
class CustomEmbeddingModel(CustomModel):
    def invoke(self, input: str) -> ModelInvocation:
        # Query the collection with the input text
        results = collection.query(
            query_texts=[input],
            n_results=5
        )
        # Return formatted query results and the model response context
        return ModelInvocation(model_prediction=query_results_to_score(results), model_output_metadata={'model_data': input})
# Register the model with Okareo
# This will return a model if it already exists or create a new one if it doesn't
model_under_test = okareo.register_model(name="vectordb_retrieval_test", model=CustomEmbeddingModel(name="custom retrieval"))

Step 4: Decide which metrics and other criteria are most important for your evaluation

Okareo offers a number of industry standard metrics for evaluating RAG retrieval systems, including accuracy, precision, recall, NDCG, MRR and MAP. You can specify which you want to use with the metrics_kwargs parameter (see code in the next section).

As well as the metrics specified above, performance and efficiency of retrieval are also important. The reranker returns the top "k" most relevant results, with k being a value you need to decide upon. Too high a value of k will slow down the retrieval process, but if k is too low, you might miss out on relevant results. Working out the correct balance between speed and relevance can be tricky, but Okareo makes it easier by allowing you to evaluate against all the above metrics for different values of k so you can choose a value that gives a good balance. Specifying at_k_intervals = [1, 2, 3, 4, 5] and then defining each metric like "accuracy_at_k": at_k_intervals will record metrics for each value of k, and you'll be able to see how each metric performed for each value of k on Okareo's retrieval evaluation dashboard.

Screenshot of part of Okareo's RAG evaluation report dashboard

Visualization on Okareo's retrieval evaluation dashboard viewing each metric at different values of k.

As our example is a precise question-answering RAG, we've chosen k intervals up to a value of 5. For question answering, the answer is typically found within 1 or 2 documents, so setting k = 5 provides a reasonable margin of tolerance without introducing unnecessary overhead. But if you're doing document search, which tends to require retrieval of larger amounts of information, you might want to increase your value of k to 10.

Step 5: Run an Okareo evaluation on your model

The run_test method is the method that actually runs the evaluation on your model. You pass in your scenario, specify that the type of evaluation is a retrieval test, and pass in the different metrics and intervals of k that you want to use for your evaluation. 

Put all the code snippets together in your Okareo flow file and run it (with okareo run -f <your_flow_script>) When the evaluation has finished running, a link will take you to your evaluation results dashboard so you can drill into your results, see how the evaluation went, and decide on a value of k.

### Evaluating the custom embedding model ###
# Import the datetime module for timestamping
from datetime import datetime
# Define thresholds for the evaluation metrics
at_k_intervals = [1, 2, 3, 4, 5] 
# Perform a test run using the uploaded scenario set
test_run_item = model_under_test.run_test(
    scenario=scenario, # use the scenario from the scenario set uploaded earlier
    name=f"Retrieval Test Run - {datetime.now().strftime('%m-%d %H:%M:%S')}", # add a timestamp to the test run name
    test_run_type=TestRunType.INFORMATION_RETRIEVAL, # specify that we are running an information retrieval test
    calculate_metrics=True,
    # Define the evaluation metrics to calculate
    metrics_kwargs={
        "accuracy_at_k": at_k_intervals ,
        "precision_recall_at_k": at_k_intervals ,
        "ndcg_at_k": at_k_intervals,
        "mrr_at_k": at_k_intervals,
        "map_at_k": at_k_intervals,
    }
)
# Generate a link back to Okareo for evaluation visualization
model_results = test_run_item.model_metrics.to_dict()
app_link = test_run_item.app_link
print(f"See results in Okareo: {app_link}")

Step 6: Interpret your evaluation results

The results dashboard begins with a metrics overview graph, which shows how each metric scored on average for each value of k. This is shown with <metric_name>@K.

Screenshot of the metrics overview graph in Okareo's RAG evaluation results dashboard

Hovering over a specific value of k (represented on the x axis), you can see the average score for each metric across all your scenarios.

When interpreting these results, you need to consider what is an acceptable level of precision for your specific RAG system. This may depend on whether its purpose is to answer a specific question or to do document search, or on how many documents are in your vector database. The WebBizz example set only has ten documents, so there is often only going to be one relevant result. This means that when k is increased, the average precision will necessarily drop.

In this example, we can see that all the metrics except average precision are higher than when k=1, but there is not such a big difference between k=2 and k=3. Hence, the best value of k is probably 2. However, if average precision needs to be higher, then this may suggest you need to make improvements to the re-ranking model to ensure that the most relevant result is always returned.

Scrolling down the page shows some row metrics. You can see the accuracy, precision, recall, MRR, NDCG, and MAP (when k=1) for each individual input query from your scenario set. Each row is one scenario from the scenario set above. Using the "Metrics @ k" slider allows you to see the different metric values for different values of k.

Screen capture of row metrics in Okareo's RAG evaluation results dashboard

When looking at row metrics, you can filter by one particular metric to discover which scenarios failed and why. Let's look at an individual accuracy metric to understand this in more depth:

Accuracy just indicates whether or not you got a relevant result in your response, so the value will always be 1.00 or 0.00. Filtering by accuracy, use the numeric slider to see only the rows with less than 100% accuracy. Using the Metrics@k slider, you can see that for k=1 there were 3 inaccuracies, for k=2 there were 2 and for k=3 or higher there were no inaccuracies.

To find out why there were inaccuracies, you can drill down further by clicking on the expanding arrows for each row to get more details. Doing this for the first inaccurate row, you'll see that the most relevant article (the one with the same ID as the expected result) is the third in the list. If k=1 or 2, this result won't get returned, so this will be marked as an inaccuracy. 

Screen capture of further details of the accuracy metrics in Okareo's RAG evaluation results dashboard

These more detailed metrics allow you to see why something failed and to gain insight into which part of the system might be at fault. In the screen capture above, you can see the top 5 results returned by the retrieval system, and the expected result is only third in the list. The document IDs are listed, which allows you to check the original data and determine what the problem is. Some examples of this include:

  • Your original scenario is wrong and there is actually a more relevant document that you missed

  • Multiple documents may be relevant, but you only selected one

  • The reranker is not doing a good enough job at ranking the most relevant articles at the top (or you're not using a reranker and should be!)

You can also drill down even further and view metadata. Earlier, you saved each category as metadata in your code, and now you can check and see what categories of article are being returned for your query.  The top result is a support article, which seems correct for the particular query.

Screenshot of metadata for each of the accuracy metrics in Okareo's RAG evaluation results dashboard

For comprehensive RAG evaluation, use Okareo

RAG evaluation is essential for optimizing the performance and accuracy of your RAG system. Each part of the RAG system needs to be evaluated in turn, and it's important to fully understand and do a good job of evaluating your retrieval system, as your downstream LLM evaluation is only as good as the evaluation of your previous components.

If you're using a classifier for intent detection, you should use Okareo to conduct in-depth RAG evaluations by first evaluating your classifier, then evaluating the retrieval system, and finally evaluate the LLM you're using. You can get started with Okareo by signing up today.

Retrieval Augmented Generation (RAG) has surged in popularity in the last year, and it is now the foundational architecture for building robust LLM applications. RAG is the fundamental approach for integrating external data into your LLM workflows: just as it's hard to imagine a web application without a database, it's now becoming hard to imagine an LLM without RAG.

RAG combines a retrieval system with an LLM (and sometimes other components such as a classification model) to facilitate dynamic, data-driven responses from the LLM, which is fast becoming indispensable for most modern-use cases.

In this guide, we show how Okareo, a RAG evaluation tool, can be used to evaluate each component of a RAG architecture.

What is RAG and when should you use it?

RAG is an architecture that initially consisted of two components: a retrieval system (which is responsible for finding relevant information from an external data source based on a user query) and an LLM. 

Almost all LLM-powered apps will incorporate a RAG these days, as RAG allows your LLM to access up-to-date or contextually relevant sources of information. Without RAG, it's hard to create an AI app that adds value.

Today, many RAG systems include multiple retrieval systems, meaning they also need a classifier to work out the user's intent behind their query and route it to the most relevant retrieval system. For example, in a customer service RAG system, a user query about technical documentation might be routed to a retrieval system with a knowledge base as the data source, whereas a general customer service question might be routed to a retrieval system with an FAQ database as the data source.

Many retrieval systems store their data as vectors in an external vector database. When a user queries the RAG system, the query is passed to the vector database, which uses a similarity search algorithm to return the documents that are closest to the original user query. These documents are later fed into your LLM.

It's also common to use a classifier to help filter your vector database results. Vector databases can store metadata alongside each document, such as category or date, which allows for easy filtering of results when the vector DB is queried, making your RAG more efficient.

For more information on what RAG is, see our RAG architecture article.

What is RAG evaluation?

RAG evaluation assesses the performance of a RAG system, which consists of multiple components that run in a specified order. Each component needs to be evaluated separately, as issues in earlier stages can have downstream effects on later components.

The components of a RAG system that need evaluating include:

  • Classifier: Evaluated using metrics like precision, recall, and F1 score to determine how well it's routing queries to the right retrieval systems (or other components).

  • Retrieval system: Includes multiple subcomponents such as a vector database, embedding model, rerank model, and result filtering. Each of these components can be evaluated as a whole or separately.

  • LLM: Assessed for accuracy, relevance, and coherence of the generated response. May include standard metrics like BLEU score or other types of scores, such as friendliness, that can be evaluated either by a human or another LLM acting as a judge.

How to evaluate your RAG model using Okareo

Okareo provides tools to evaluate each component of a RAG system, including classification models, the entire retrieval pipeline, and LLMs. You can create scenarios and run evaluations using the Okareo app or programmatically using Okareo's Python or TypeScript SDKs. This latter method is the recommended way to do it once you want to start chaining together the evaluation of multiple parts of your RAG and running the same evaluations on a regular basis.

In this guide, we’ll focus on evaluating the retrieval system, as it’s typically the most complex part of a RAG system to evaluate. Okareo has different metrics for evaluating each part of the retrieval system (vector database, embedding model, rerank model, result filtering etc.) helping you evaluate different things like precision, ranking accuracy, and relevance.

If you want to see how to evaluate other parts of a RAG system, we offer guides on evaluating a classifier using Okareo (which is the simplest component to evaluate) and on evaluating LLMs.

RAG Evaluation: The retrieval system

Most RAG retrieval systems consist of an embedding model and a vector database as the core architecture, but the most optimized and efficient RAG systems usually also include a rerank model and result filtering.

  • Embedding model: An embedding model converts input text (queries, documents, etc.) into vector embeddings, a type of vector that captures semantic meanings.The embedding is then sent as a query to the vector database.

  • Vector database: Stores external data that has already been converted into vector embeddings. When queried by the embedding model, the vector DB does a similarity search to find documents with semantic similarity to the vector embedding query (and by extension, the original text) and returns a shortlist of the IDs of the most similar documents, plus the corresponding text.

  • Reranker: This extra layer is often required to refine the ranking of the documents returned by the vector DB. The documents get re-ranked according to a deeper, more context-aware analysis of their relevance to the original query, which generally improves the quality of your results. The reranker will return the top "k" most relevant results, where k is a value you specify. Evaluating your retrieval system as you build your RAG will help you determine the best value of k. More details will be given on this later.

  • Result filtering: This refines your results even more by applying metadata-based filters, such as date, category, or document type to your query. For example, you might choose to filter your results by topic to help match the user's intent, or you might filter out any documents older than a certain date (like in a news retrieval system where recent results are more important).

Evaluating each of these components helps ensure that the retrieval part of your RAG system is not holding back your LLM farther down the pipeline.

Retrieval evaluation: Step-by-step tutorial

Step 1 (pre-evaluation): Fill your vector database with data

For this tutorial, we provide full example data so you can follow along, and the full code example is available on GitHub.

The example assumes you've downloaded the Okareo CLI and have followed the instructions to export environment variables and to initialize an Okareo project, meaning the code examples below belong in a file like <your_flow_script>.py.

The example consists of a customer service question-answering RAG system for a sample company called WebBizz. The code below reads some WebBizz knowledge base articles, adds categories to them as metadata, and saves the documents and metadata in ChromaDB, a free, open-source vector DB that's easy to use and doesn't require an account. ChromaDB has a default embedding model built into it, which gets used when you add data to it or query it.

### Load documents and create corresponding metadata to be later added to vector DB ###
# Import the necessary libraries
import os
from io import StringIO  
import pandas as pd
# Load documents from Okareo's GitHub repository
webbizz_articles = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_10_articles.jsonl').read()
# Convert the JSONL string to a pandas DataFrame
jsonObj = pd.read_json(path_or_buf=StringIO(webbizz_articles), lines=True)
# Create rough categories for each document based on the content
# Store the categories in metadata_list
metadata_list = []
input_list = list(jsonObj.input)
for i in range(len(input_list)):
    if "sustainability" in input_list[i] or "security" in list(input_list[i]):
        metadata_list.append({"article_type": "Safety and sustainability"})
    elif "support" in input_list[i] or "help" in list(input_list[i]):
        metadata_list.append({"article_type": "Support"})
    elif "return" in input_list[i]:
        metadata_list.append({"article_type": "Return and exchange"})
    else:
        metadata_list.append({"article_type": "Miscellaneous"})
### Create ChromaDB instance and add documents and metadata to it ###
# Import ChromaDB
import chromadb
# Create a ChromaDB client
chroma_client = chromadb.Client()
# Create a ChromaDB collection
# The collection will be used to store the documents as vector embeddings
# We want to measure the similarity between questions and documents using cosine similarity
collection = chroma_client.create_collection(name="retrieval_test", metadata={"hnsw:space": "cosine"})
# Add the documents to the collection with the corresponding metadata (the in-built embedding model converts the documents to vector embeddings)
collection.add(
    documents=list(jsonObj.input),
    ids=list(jsonObj.result),
    metadatas=metadata_list
)

Step 2: Create a scenario set

A scenario set is a set of example input queries to the retrieval system along with their expected results, which in this case is a list of IDs of the most relevant documents in the vector DB. 

We provide an example scenario set for the WebBizz question-answering RAG system. This includes inputs (example question) and the corresponding expected results (which are determined by a subject matter expert).

### Create a scenario set ###
# Import libraries
import tempfile
from okareo import Okareo
from okareo_api_client.models import TestRunType
from okareo.model_under_test import CustomModel, ModelInvocation
# Create an instance of the Okareo client
OKAREO_API_KEY = os.environ.get("OKAREO_API_KEY")
if not OKAREO_API_KEY:
    raise ValueError("OKAREO_API_KEY environment variable is not set")
okareo = Okareo(OKAREO_API_KEY)
# Download questions from Okareo's GitHub repository
webbizz_retrieval_questions = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_retrieval_questions.jsonl').read()
# Save the questions to a temporary file
temp_dir = tempfile.gettempdir()
file_path = os.path.join(temp_dir, "webbizz_retrieval_questions.jsonl")
with open(file_path, "w+") as file:
    file.write(webbizz_retrieval_questions)
# Upload the questions to Okareo from the temporary file
scenario = okareo.upload_scenario_set(file_path=file_path, scenario_name="Retrieval Articles Scenario")
# Clean up the temporary file
os.remove(file_path)

Step 3: Register the embedding model and vector DB with Okareo

Okareo supports any embedding model or vector DB. Some of these have direct built-in support such as Pinecone and QDrant vector DBs, and the Cohere embedding model, but any can be supported through our CustomModel class. 

In this example we're using ChromaDB and its built-in embedding model by creating a new class by using Okareo's CustomModel class as a base. The code below shows a custom embedding model being defined and registered with Okareo. 

Later, when Okareo calls the custom model's invoke endpoint, this will return the top 5 most relevant results from ChromaDB (this is specified with n_results=5, but you may choose to change this number according to your evaluation results). The query_results_to_score function converts the results into the JSON format that Okareo expects.

### Create custom embedding model and register it ###
# A function to convert the query results from our ChromaDB collection into a list of dictionaries with the document ID, score, metadata, and label
def query_results_to_score(results):
    parsed_ids_with_scores = []
    for i in range(0, len(results['distances'][0])):
        # Create a score based on cosine similarity
        score = (2 - results['distances'][0][i]) / 2
        parsed_ids_with_scores.append(
            {
                "id": results['ids'][0][i],
                "score": score,
                "metadata": results['metadatas'][0][i],
                "label": f"{results['metadatas'][0][i]['article_type']} WebBizz Article w/ ID: {results['ids'][0][i]}"
            }
        )
    return parsed_ids_with_scores
# Define a custom retrieval model that uses the ChromaDB collection to retrieve documents
# The model will return the top 5 most relevant documents based on the input query
class CustomEmbeddingModel(CustomModel):
    def invoke(self, input: str) -> ModelInvocation:
        # Query the collection with the input text
        results = collection.query(
            query_texts=[input],
            n_results=5
        )
        # Return formatted query results and the model response context
        return ModelInvocation(model_prediction=query_results_to_score(results), model_output_metadata={'model_data': input})
# Register the model with Okareo
# This will return a model if it already exists or create a new one if it doesn't
model_under_test = okareo.register_model(name="vectordb_retrieval_test", model=CustomEmbeddingModel(name="custom retrieval"))

Step 4: Decide which metrics and other criteria are most important for your evaluation

Okareo offers a number of industry standard metrics for evaluating RAG retrieval systems, including accuracy, precision, recall, NDCG, MRR and MAP. You can specify which you want to use with the metrics_kwargs parameter (see code in the next section).

As well as the metrics specified above, performance and efficiency of retrieval are also important. The reranker returns the top "k" most relevant results, with k being a value you need to decide upon. Too high a value of k will slow down the retrieval process, but if k is too low, you might miss out on relevant results. Working out the correct balance between speed and relevance can be tricky, but Okareo makes it easier by allowing you to evaluate against all the above metrics for different values of k so you can choose a value that gives a good balance. Specifying at_k_intervals = [1, 2, 3, 4, 5] and then defining each metric like "accuracy_at_k": at_k_intervals will record metrics for each value of k, and you'll be able to see how each metric performed for each value of k on Okareo's retrieval evaluation dashboard.

Screenshot of part of Okareo's RAG evaluation report dashboard

Visualization on Okareo's retrieval evaluation dashboard viewing each metric at different values of k.

As our example is a precise question-answering RAG, we've chosen k intervals up to a value of 5. For question answering, the answer is typically found within 1 or 2 documents, so setting k = 5 provides a reasonable margin of tolerance without introducing unnecessary overhead. But if you're doing document search, which tends to require retrieval of larger amounts of information, you might want to increase your value of k to 10.

Step 5: Run an Okareo evaluation on your model

The run_test method is the method that actually runs the evaluation on your model. You pass in your scenario, specify that the type of evaluation is a retrieval test, and pass in the different metrics and intervals of k that you want to use for your evaluation. 

Put all the code snippets together in your Okareo flow file and run it (with okareo run -f <your_flow_script>) When the evaluation has finished running, a link will take you to your evaluation results dashboard so you can drill into your results, see how the evaluation went, and decide on a value of k.

### Evaluating the custom embedding model ###
# Import the datetime module for timestamping
from datetime import datetime
# Define thresholds for the evaluation metrics
at_k_intervals = [1, 2, 3, 4, 5] 
# Perform a test run using the uploaded scenario set
test_run_item = model_under_test.run_test(
    scenario=scenario, # use the scenario from the scenario set uploaded earlier
    name=f"Retrieval Test Run - {datetime.now().strftime('%m-%d %H:%M:%S')}", # add a timestamp to the test run name
    test_run_type=TestRunType.INFORMATION_RETRIEVAL, # specify that we are running an information retrieval test
    calculate_metrics=True,
    # Define the evaluation metrics to calculate
    metrics_kwargs={
        "accuracy_at_k": at_k_intervals ,
        "precision_recall_at_k": at_k_intervals ,
        "ndcg_at_k": at_k_intervals,
        "mrr_at_k": at_k_intervals,
        "map_at_k": at_k_intervals,
    }
)
# Generate a link back to Okareo for evaluation visualization
model_results = test_run_item.model_metrics.to_dict()
app_link = test_run_item.app_link
print(f"See results in Okareo: {app_link}")

Step 6: Interpret your evaluation results

The results dashboard begins with a metrics overview graph, which shows how each metric scored on average for each value of k. This is shown with <metric_name>@K.

Screenshot of the metrics overview graph in Okareo's RAG evaluation results dashboard

Hovering over a specific value of k (represented on the x axis), you can see the average score for each metric across all your scenarios.

When interpreting these results, you need to consider what is an acceptable level of precision for your specific RAG system. This may depend on whether its purpose is to answer a specific question or to do document search, or on how many documents are in your vector database. The WebBizz example set only has ten documents, so there is often only going to be one relevant result. This means that when k is increased, the average precision will necessarily drop.

In this example, we can see that all the metrics except average precision are higher than when k=1, but there is not such a big difference between k=2 and k=3. Hence, the best value of k is probably 2. However, if average precision needs to be higher, then this may suggest you need to make improvements to the re-ranking model to ensure that the most relevant result is always returned.

Scrolling down the page shows some row metrics. You can see the accuracy, precision, recall, MRR, NDCG, and MAP (when k=1) for each individual input query from your scenario set. Each row is one scenario from the scenario set above. Using the "Metrics @ k" slider allows you to see the different metric values for different values of k.

Screen capture of row metrics in Okareo's RAG evaluation results dashboard

When looking at row metrics, you can filter by one particular metric to discover which scenarios failed and why. Let's look at an individual accuracy metric to understand this in more depth:

Accuracy just indicates whether or not you got a relevant result in your response, so the value will always be 1.00 or 0.00. Filtering by accuracy, use the numeric slider to see only the rows with less than 100% accuracy. Using the Metrics@k slider, you can see that for k=1 there were 3 inaccuracies, for k=2 there were 2 and for k=3 or higher there were no inaccuracies.

To find out why there were inaccuracies, you can drill down further by clicking on the expanding arrows for each row to get more details. Doing this for the first inaccurate row, you'll see that the most relevant article (the one with the same ID as the expected result) is the third in the list. If k=1 or 2, this result won't get returned, so this will be marked as an inaccuracy. 

Screen capture of further details of the accuracy metrics in Okareo's RAG evaluation results dashboard

These more detailed metrics allow you to see why something failed and to gain insight into which part of the system might be at fault. In the screen capture above, you can see the top 5 results returned by the retrieval system, and the expected result is only third in the list. The document IDs are listed, which allows you to check the original data and determine what the problem is. Some examples of this include:

  • Your original scenario is wrong and there is actually a more relevant document that you missed

  • Multiple documents may be relevant, but you only selected one

  • The reranker is not doing a good enough job at ranking the most relevant articles at the top (or you're not using a reranker and should be!)

You can also drill down even further and view metadata. Earlier, you saved each category as metadata in your code, and now you can check and see what categories of article are being returned for your query.  The top result is a support article, which seems correct for the particular query.

Screenshot of metadata for each of the accuracy metrics in Okareo's RAG evaluation results dashboard

For comprehensive RAG evaluation, use Okareo

RAG evaluation is essential for optimizing the performance and accuracy of your RAG system. Each part of the RAG system needs to be evaluated in turn, and it's important to fully understand and do a good job of evaluating your retrieval system, as your downstream LLM evaluation is only as good as the evaluation of your previous components.

If you're using a classifier for intent detection, you should use Okareo to conduct in-depth RAG evaluations by first evaluating your classifier, then evaluating the retrieval system, and finally evaluate the LLM you're using. You can get started with Okareo by signing up today.

Retrieval Augmented Generation (RAG) has surged in popularity in the last year, and it is now the foundational architecture for building robust LLM applications. RAG is the fundamental approach for integrating external data into your LLM workflows: just as it's hard to imagine a web application without a database, it's now becoming hard to imagine an LLM without RAG.

RAG combines a retrieval system with an LLM (and sometimes other components such as a classification model) to facilitate dynamic, data-driven responses from the LLM, which is fast becoming indispensable for most modern-use cases.

In this guide, we show how Okareo, a RAG evaluation tool, can be used to evaluate each component of a RAG architecture.

What is RAG and when should you use it?

RAG is an architecture that initially consisted of two components: a retrieval system (which is responsible for finding relevant information from an external data source based on a user query) and an LLM. 

Almost all LLM-powered apps will incorporate a RAG these days, as RAG allows your LLM to access up-to-date or contextually relevant sources of information. Without RAG, it's hard to create an AI app that adds value.

Today, many RAG systems include multiple retrieval systems, meaning they also need a classifier to work out the user's intent behind their query and route it to the most relevant retrieval system. For example, in a customer service RAG system, a user query about technical documentation might be routed to a retrieval system with a knowledge base as the data source, whereas a general customer service question might be routed to a retrieval system with an FAQ database as the data source.

Many retrieval systems store their data as vectors in an external vector database. When a user queries the RAG system, the query is passed to the vector database, which uses a similarity search algorithm to return the documents that are closest to the original user query. These documents are later fed into your LLM.

It's also common to use a classifier to help filter your vector database results. Vector databases can store metadata alongside each document, such as category or date, which allows for easy filtering of results when the vector DB is queried, making your RAG more efficient.

For more information on what RAG is, see our RAG architecture article.

What is RAG evaluation?

RAG evaluation assesses the performance of a RAG system, which consists of multiple components that run in a specified order. Each component needs to be evaluated separately, as issues in earlier stages can have downstream effects on later components.

The components of a RAG system that need evaluating include:

  • Classifier: Evaluated using metrics like precision, recall, and F1 score to determine how well it's routing queries to the right retrieval systems (or other components).

  • Retrieval system: Includes multiple subcomponents such as a vector database, embedding model, rerank model, and result filtering. Each of these components can be evaluated as a whole or separately.

  • LLM: Assessed for accuracy, relevance, and coherence of the generated response. May include standard metrics like BLEU score or other types of scores, such as friendliness, that can be evaluated either by a human or another LLM acting as a judge.

How to evaluate your RAG model using Okareo

Okareo provides tools to evaluate each component of a RAG system, including classification models, the entire retrieval pipeline, and LLMs. You can create scenarios and run evaluations using the Okareo app or programmatically using Okareo's Python or TypeScript SDKs. This latter method is the recommended way to do it once you want to start chaining together the evaluation of multiple parts of your RAG and running the same evaluations on a regular basis.

In this guide, we’ll focus on evaluating the retrieval system, as it’s typically the most complex part of a RAG system to evaluate. Okareo has different metrics for evaluating each part of the retrieval system (vector database, embedding model, rerank model, result filtering etc.) helping you evaluate different things like precision, ranking accuracy, and relevance.

If you want to see how to evaluate other parts of a RAG system, we offer guides on evaluating a classifier using Okareo (which is the simplest component to evaluate) and on evaluating LLMs.

RAG Evaluation: The retrieval system

Most RAG retrieval systems consist of an embedding model and a vector database as the core architecture, but the most optimized and efficient RAG systems usually also include a rerank model and result filtering.

  • Embedding model: An embedding model converts input text (queries, documents, etc.) into vector embeddings, a type of vector that captures semantic meanings.The embedding is then sent as a query to the vector database.

  • Vector database: Stores external data that has already been converted into vector embeddings. When queried by the embedding model, the vector DB does a similarity search to find documents with semantic similarity to the vector embedding query (and by extension, the original text) and returns a shortlist of the IDs of the most similar documents, plus the corresponding text.

  • Reranker: This extra layer is often required to refine the ranking of the documents returned by the vector DB. The documents get re-ranked according to a deeper, more context-aware analysis of their relevance to the original query, which generally improves the quality of your results. The reranker will return the top "k" most relevant results, where k is a value you specify. Evaluating your retrieval system as you build your RAG will help you determine the best value of k. More details will be given on this later.

  • Result filtering: This refines your results even more by applying metadata-based filters, such as date, category, or document type to your query. For example, you might choose to filter your results by topic to help match the user's intent, or you might filter out any documents older than a certain date (like in a news retrieval system where recent results are more important).

Evaluating each of these components helps ensure that the retrieval part of your RAG system is not holding back your LLM farther down the pipeline.

Retrieval evaluation: Step-by-step tutorial

Step 1 (pre-evaluation): Fill your vector database with data

For this tutorial, we provide full example data so you can follow along, and the full code example is available on GitHub.

The example assumes you've downloaded the Okareo CLI and have followed the instructions to export environment variables and to initialize an Okareo project, meaning the code examples below belong in a file like <your_flow_script>.py.

The example consists of a customer service question-answering RAG system for a sample company called WebBizz. The code below reads some WebBizz knowledge base articles, adds categories to them as metadata, and saves the documents and metadata in ChromaDB, a free, open-source vector DB that's easy to use and doesn't require an account. ChromaDB has a default embedding model built into it, which gets used when you add data to it or query it.

### Load documents and create corresponding metadata to be later added to vector DB ###
# Import the necessary libraries
import os
from io import StringIO  
import pandas as pd
# Load documents from Okareo's GitHub repository
webbizz_articles = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_10_articles.jsonl').read()
# Convert the JSONL string to a pandas DataFrame
jsonObj = pd.read_json(path_or_buf=StringIO(webbizz_articles), lines=True)
# Create rough categories for each document based on the content
# Store the categories in metadata_list
metadata_list = []
input_list = list(jsonObj.input)
for i in range(len(input_list)):
    if "sustainability" in input_list[i] or "security" in list(input_list[i]):
        metadata_list.append({"article_type": "Safety and sustainability"})
    elif "support" in input_list[i] or "help" in list(input_list[i]):
        metadata_list.append({"article_type": "Support"})
    elif "return" in input_list[i]:
        metadata_list.append({"article_type": "Return and exchange"})
    else:
        metadata_list.append({"article_type": "Miscellaneous"})
### Create ChromaDB instance and add documents and metadata to it ###
# Import ChromaDB
import chromadb
# Create a ChromaDB client
chroma_client = chromadb.Client()
# Create a ChromaDB collection
# The collection will be used to store the documents as vector embeddings
# We want to measure the similarity between questions and documents using cosine similarity
collection = chroma_client.create_collection(name="retrieval_test", metadata={"hnsw:space": "cosine"})
# Add the documents to the collection with the corresponding metadata (the in-built embedding model converts the documents to vector embeddings)
collection.add(
    documents=list(jsonObj.input),
    ids=list(jsonObj.result),
    metadatas=metadata_list
)

Step 2: Create a scenario set

A scenario set is a set of example input queries to the retrieval system along with their expected results, which in this case is a list of IDs of the most relevant documents in the vector DB. 

We provide an example scenario set for the WebBizz question-answering RAG system. This includes inputs (example question) and the corresponding expected results (which are determined by a subject matter expert).

### Create a scenario set ###
# Import libraries
import tempfile
from okareo import Okareo
from okareo_api_client.models import TestRunType
from okareo.model_under_test import CustomModel, ModelInvocation
# Create an instance of the Okareo client
OKAREO_API_KEY = os.environ.get("OKAREO_API_KEY")
if not OKAREO_API_KEY:
    raise ValueError("OKAREO_API_KEY environment variable is not set")
okareo = Okareo(OKAREO_API_KEY)
# Download questions from Okareo's GitHub repository
webbizz_retrieval_questions = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_retrieval_questions.jsonl').read()
# Save the questions to a temporary file
temp_dir = tempfile.gettempdir()
file_path = os.path.join(temp_dir, "webbizz_retrieval_questions.jsonl")
with open(file_path, "w+") as file:
    file.write(webbizz_retrieval_questions)
# Upload the questions to Okareo from the temporary file
scenario = okareo.upload_scenario_set(file_path=file_path, scenario_name="Retrieval Articles Scenario")
# Clean up the temporary file
os.remove(file_path)

Step 3: Register the embedding model and vector DB with Okareo

Okareo supports any embedding model or vector DB. Some of these have direct built-in support such as Pinecone and QDrant vector DBs, and the Cohere embedding model, but any can be supported through our CustomModel class. 

In this example we're using ChromaDB and its built-in embedding model by creating a new class by using Okareo's CustomModel class as a base. The code below shows a custom embedding model being defined and registered with Okareo. 

Later, when Okareo calls the custom model's invoke endpoint, this will return the top 5 most relevant results from ChromaDB (this is specified with n_results=5, but you may choose to change this number according to your evaluation results). The query_results_to_score function converts the results into the JSON format that Okareo expects.

### Create custom embedding model and register it ###
# A function to convert the query results from our ChromaDB collection into a list of dictionaries with the document ID, score, metadata, and label
def query_results_to_score(results):
    parsed_ids_with_scores = []
    for i in range(0, len(results['distances'][0])):
        # Create a score based on cosine similarity
        score = (2 - results['distances'][0][i]) / 2
        parsed_ids_with_scores.append(
            {
                "id": results['ids'][0][i],
                "score": score,
                "metadata": results['metadatas'][0][i],
                "label": f"{results['metadatas'][0][i]['article_type']} WebBizz Article w/ ID: {results['ids'][0][i]}"
            }
        )
    return parsed_ids_with_scores
# Define a custom retrieval model that uses the ChromaDB collection to retrieve documents
# The model will return the top 5 most relevant documents based on the input query
class CustomEmbeddingModel(CustomModel):
    def invoke(self, input: str) -> ModelInvocation:
        # Query the collection with the input text
        results = collection.query(
            query_texts=[input],
            n_results=5
        )
        # Return formatted query results and the model response context
        return ModelInvocation(model_prediction=query_results_to_score(results), model_output_metadata={'model_data': input})
# Register the model with Okareo
# This will return a model if it already exists or create a new one if it doesn't
model_under_test = okareo.register_model(name="vectordb_retrieval_test", model=CustomEmbeddingModel(name="custom retrieval"))

Step 4: Decide which metrics and other criteria are most important for your evaluation

Okareo offers a number of industry standard metrics for evaluating RAG retrieval systems, including accuracy, precision, recall, NDCG, MRR and MAP. You can specify which you want to use with the metrics_kwargs parameter (see code in the next section).

As well as the metrics specified above, performance and efficiency of retrieval are also important. The reranker returns the top "k" most relevant results, with k being a value you need to decide upon. Too high a value of k will slow down the retrieval process, but if k is too low, you might miss out on relevant results. Working out the correct balance between speed and relevance can be tricky, but Okareo makes it easier by allowing you to evaluate against all the above metrics for different values of k so you can choose a value that gives a good balance. Specifying at_k_intervals = [1, 2, 3, 4, 5] and then defining each metric like "accuracy_at_k": at_k_intervals will record metrics for each value of k, and you'll be able to see how each metric performed for each value of k on Okareo's retrieval evaluation dashboard.

Screenshot of part of Okareo's RAG evaluation report dashboard

Visualization on Okareo's retrieval evaluation dashboard viewing each metric at different values of k.

As our example is a precise question-answering RAG, we've chosen k intervals up to a value of 5. For question answering, the answer is typically found within 1 or 2 documents, so setting k = 5 provides a reasonable margin of tolerance without introducing unnecessary overhead. But if you're doing document search, which tends to require retrieval of larger amounts of information, you might want to increase your value of k to 10.

Step 5: Run an Okareo evaluation on your model

The run_test method is the method that actually runs the evaluation on your model. You pass in your scenario, specify that the type of evaluation is a retrieval test, and pass in the different metrics and intervals of k that you want to use for your evaluation. 

Put all the code snippets together in your Okareo flow file and run it (with okareo run -f <your_flow_script>) When the evaluation has finished running, a link will take you to your evaluation results dashboard so you can drill into your results, see how the evaluation went, and decide on a value of k.

### Evaluating the custom embedding model ###
# Import the datetime module for timestamping
from datetime import datetime
# Define thresholds for the evaluation metrics
at_k_intervals = [1, 2, 3, 4, 5] 
# Perform a test run using the uploaded scenario set
test_run_item = model_under_test.run_test(
    scenario=scenario, # use the scenario from the scenario set uploaded earlier
    name=f"Retrieval Test Run - {datetime.now().strftime('%m-%d %H:%M:%S')}", # add a timestamp to the test run name
    test_run_type=TestRunType.INFORMATION_RETRIEVAL, # specify that we are running an information retrieval test
    calculate_metrics=True,
    # Define the evaluation metrics to calculate
    metrics_kwargs={
        "accuracy_at_k": at_k_intervals ,
        "precision_recall_at_k": at_k_intervals ,
        "ndcg_at_k": at_k_intervals,
        "mrr_at_k": at_k_intervals,
        "map_at_k": at_k_intervals,
    }
)
# Generate a link back to Okareo for evaluation visualization
model_results = test_run_item.model_metrics.to_dict()
app_link = test_run_item.app_link
print(f"See results in Okareo: {app_link}")

Step 6: Interpret your evaluation results

The results dashboard begins with a metrics overview graph, which shows how each metric scored on average for each value of k. This is shown with <metric_name>@K.

Screenshot of the metrics overview graph in Okareo's RAG evaluation results dashboard

Hovering over a specific value of k (represented on the x axis), you can see the average score for each metric across all your scenarios.

When interpreting these results, you need to consider what is an acceptable level of precision for your specific RAG system. This may depend on whether its purpose is to answer a specific question or to do document search, or on how many documents are in your vector database. The WebBizz example set only has ten documents, so there is often only going to be one relevant result. This means that when k is increased, the average precision will necessarily drop.

In this example, we can see that all the metrics except average precision are higher than when k=1, but there is not such a big difference between k=2 and k=3. Hence, the best value of k is probably 2. However, if average precision needs to be higher, then this may suggest you need to make improvements to the re-ranking model to ensure that the most relevant result is always returned.

Scrolling down the page shows some row metrics. You can see the accuracy, precision, recall, MRR, NDCG, and MAP (when k=1) for each individual input query from your scenario set. Each row is one scenario from the scenario set above. Using the "Metrics @ k" slider allows you to see the different metric values for different values of k.

Screen capture of row metrics in Okareo's RAG evaluation results dashboard

When looking at row metrics, you can filter by one particular metric to discover which scenarios failed and why. Let's look at an individual accuracy metric to understand this in more depth:

Accuracy just indicates whether or not you got a relevant result in your response, so the value will always be 1.00 or 0.00. Filtering by accuracy, use the numeric slider to see only the rows with less than 100% accuracy. Using the Metrics@k slider, you can see that for k=1 there were 3 inaccuracies, for k=2 there were 2 and for k=3 or higher there were no inaccuracies.

To find out why there were inaccuracies, you can drill down further by clicking on the expanding arrows for each row to get more details. Doing this for the first inaccurate row, you'll see that the most relevant article (the one with the same ID as the expected result) is the third in the list. If k=1 or 2, this result won't get returned, so this will be marked as an inaccuracy. 

Screen capture of further details of the accuracy metrics in Okareo's RAG evaluation results dashboard

These more detailed metrics allow you to see why something failed and to gain insight into which part of the system might be at fault. In the screen capture above, you can see the top 5 results returned by the retrieval system, and the expected result is only third in the list. The document IDs are listed, which allows you to check the original data and determine what the problem is. Some examples of this include:

  • Your original scenario is wrong and there is actually a more relevant document that you missed

  • Multiple documents may be relevant, but you only selected one

  • The reranker is not doing a good enough job at ranking the most relevant articles at the top (or you're not using a reranker and should be!)

You can also drill down even further and view metadata. Earlier, you saved each category as metadata in your code, and now you can check and see what categories of article are being returned for your query.  The top result is a support article, which seems correct for the particular query.

Screenshot of metadata for each of the accuracy metrics in Okareo's RAG evaluation results dashboard

For comprehensive RAG evaluation, use Okareo

RAG evaluation is essential for optimizing the performance and accuracy of your RAG system. Each part of the RAG system needs to be evaluated in turn, and it's important to fully understand and do a good job of evaluating your retrieval system, as your downstream LLM evaluation is only as good as the evaluation of your previous components.

If you're using a classifier for intent detection, you should use Okareo to conduct in-depth RAG evaluations by first evaluating your classifier, then evaluating the retrieval system, and finally evaluate the LLM you're using. You can get started with Okareo by signing up today.

Retrieval Augmented Generation (RAG) has surged in popularity in the last year, and it is now the foundational architecture for building robust LLM applications. RAG is the fundamental approach for integrating external data into your LLM workflows: just as it's hard to imagine a web application without a database, it's now becoming hard to imagine an LLM without RAG.

RAG combines a retrieval system with an LLM (and sometimes other components such as a classification model) to facilitate dynamic, data-driven responses from the LLM, which is fast becoming indispensable for most modern-use cases.

In this guide, we show how Okareo, a RAG evaluation tool, can be used to evaluate each component of a RAG architecture.

What is RAG and when should you use it?

RAG is an architecture that initially consisted of two components: a retrieval system (which is responsible for finding relevant information from an external data source based on a user query) and an LLM. 

Almost all LLM-powered apps will incorporate a RAG these days, as RAG allows your LLM to access up-to-date or contextually relevant sources of information. Without RAG, it's hard to create an AI app that adds value.

Today, many RAG systems include multiple retrieval systems, meaning they also need a classifier to work out the user's intent behind their query and route it to the most relevant retrieval system. For example, in a customer service RAG system, a user query about technical documentation might be routed to a retrieval system with a knowledge base as the data source, whereas a general customer service question might be routed to a retrieval system with an FAQ database as the data source.

Many retrieval systems store their data as vectors in an external vector database. When a user queries the RAG system, the query is passed to the vector database, which uses a similarity search algorithm to return the documents that are closest to the original user query. These documents are later fed into your LLM.

It's also common to use a classifier to help filter your vector database results. Vector databases can store metadata alongside each document, such as category or date, which allows for easy filtering of results when the vector DB is queried, making your RAG more efficient.

For more information on what RAG is, see our RAG architecture article.

What is RAG evaluation?

RAG evaluation assesses the performance of a RAG system, which consists of multiple components that run in a specified order. Each component needs to be evaluated separately, as issues in earlier stages can have downstream effects on later components.

The components of a RAG system that need evaluating include:

  • Classifier: Evaluated using metrics like precision, recall, and F1 score to determine how well it's routing queries to the right retrieval systems (or other components).

  • Retrieval system: Includes multiple subcomponents such as a vector database, embedding model, rerank model, and result filtering. Each of these components can be evaluated as a whole or separately.

  • LLM: Assessed for accuracy, relevance, and coherence of the generated response. May include standard metrics like BLEU score or other types of scores, such as friendliness, that can be evaluated either by a human or another LLM acting as a judge.

How to evaluate your RAG model using Okareo

Okareo provides tools to evaluate each component of a RAG system, including classification models, the entire retrieval pipeline, and LLMs. You can create scenarios and run evaluations using the Okareo app or programmatically using Okareo's Python or TypeScript SDKs. This latter method is the recommended way to do it once you want to start chaining together the evaluation of multiple parts of your RAG and running the same evaluations on a regular basis.

In this guide, we’ll focus on evaluating the retrieval system, as it’s typically the most complex part of a RAG system to evaluate. Okareo has different metrics for evaluating each part of the retrieval system (vector database, embedding model, rerank model, result filtering etc.) helping you evaluate different things like precision, ranking accuracy, and relevance.

If you want to see how to evaluate other parts of a RAG system, we offer guides on evaluating a classifier using Okareo (which is the simplest component to evaluate) and on evaluating LLMs.

RAG Evaluation: The retrieval system

Most RAG retrieval systems consist of an embedding model and a vector database as the core architecture, but the most optimized and efficient RAG systems usually also include a rerank model and result filtering.

  • Embedding model: An embedding model converts input text (queries, documents, etc.) into vector embeddings, a type of vector that captures semantic meanings.The embedding is then sent as a query to the vector database.

  • Vector database: Stores external data that has already been converted into vector embeddings. When queried by the embedding model, the vector DB does a similarity search to find documents with semantic similarity to the vector embedding query (and by extension, the original text) and returns a shortlist of the IDs of the most similar documents, plus the corresponding text.

  • Reranker: This extra layer is often required to refine the ranking of the documents returned by the vector DB. The documents get re-ranked according to a deeper, more context-aware analysis of their relevance to the original query, which generally improves the quality of your results. The reranker will return the top "k" most relevant results, where k is a value you specify. Evaluating your retrieval system as you build your RAG will help you determine the best value of k. More details will be given on this later.

  • Result filtering: This refines your results even more by applying metadata-based filters, such as date, category, or document type to your query. For example, you might choose to filter your results by topic to help match the user's intent, or you might filter out any documents older than a certain date (like in a news retrieval system where recent results are more important).

Evaluating each of these components helps ensure that the retrieval part of your RAG system is not holding back your LLM farther down the pipeline.

Retrieval evaluation: Step-by-step tutorial

Step 1 (pre-evaluation): Fill your vector database with data

For this tutorial, we provide full example data so you can follow along, and the full code example is available on GitHub.

The example assumes you've downloaded the Okareo CLI and have followed the instructions to export environment variables and to initialize an Okareo project, meaning the code examples below belong in a file like <your_flow_script>.py.

The example consists of a customer service question-answering RAG system for a sample company called WebBizz. The code below reads some WebBizz knowledge base articles, adds categories to them as metadata, and saves the documents and metadata in ChromaDB, a free, open-source vector DB that's easy to use and doesn't require an account. ChromaDB has a default embedding model built into it, which gets used when you add data to it or query it.

### Load documents and create corresponding metadata to be later added to vector DB ###
# Import the necessary libraries
import os
from io import StringIO  
import pandas as pd
# Load documents from Okareo's GitHub repository
webbizz_articles = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_10_articles.jsonl').read()
# Convert the JSONL string to a pandas DataFrame
jsonObj = pd.read_json(path_or_buf=StringIO(webbizz_articles), lines=True)
# Create rough categories for each document based on the content
# Store the categories in metadata_list
metadata_list = []
input_list = list(jsonObj.input)
for i in range(len(input_list)):
    if "sustainability" in input_list[i] or "security" in list(input_list[i]):
        metadata_list.append({"article_type": "Safety and sustainability"})
    elif "support" in input_list[i] or "help" in list(input_list[i]):
        metadata_list.append({"article_type": "Support"})
    elif "return" in input_list[i]:
        metadata_list.append({"article_type": "Return and exchange"})
    else:
        metadata_list.append({"article_type": "Miscellaneous"})
### Create ChromaDB instance and add documents and metadata to it ###
# Import ChromaDB
import chromadb
# Create a ChromaDB client
chroma_client = chromadb.Client()
# Create a ChromaDB collection
# The collection will be used to store the documents as vector embeddings
# We want to measure the similarity between questions and documents using cosine similarity
collection = chroma_client.create_collection(name="retrieval_test", metadata={"hnsw:space": "cosine"})
# Add the documents to the collection with the corresponding metadata (the in-built embedding model converts the documents to vector embeddings)
collection.add(
    documents=list(jsonObj.input),
    ids=list(jsonObj.result),
    metadatas=metadata_list
)

Step 2: Create a scenario set

A scenario set is a set of example input queries to the retrieval system along with their expected results, which in this case is a list of IDs of the most relevant documents in the vector DB. 

We provide an example scenario set for the WebBizz question-answering RAG system. This includes inputs (example question) and the corresponding expected results (which are determined by a subject matter expert).

### Create a scenario set ###
# Import libraries
import tempfile
from okareo import Okareo
from okareo_api_client.models import TestRunType
from okareo.model_under_test import CustomModel, ModelInvocation
# Create an instance of the Okareo client
OKAREO_API_KEY = os.environ.get("OKAREO_API_KEY")
if not OKAREO_API_KEY:
    raise ValueError("OKAREO_API_KEY environment variable is not set")
okareo = Okareo(OKAREO_API_KEY)
# Download questions from Okareo's GitHub repository
webbizz_retrieval_questions = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_retrieval_questions.jsonl').read()
# Save the questions to a temporary file
temp_dir = tempfile.gettempdir()
file_path = os.path.join(temp_dir, "webbizz_retrieval_questions.jsonl")
with open(file_path, "w+") as file:
    file.write(webbizz_retrieval_questions)
# Upload the questions to Okareo from the temporary file
scenario = okareo.upload_scenario_set(file_path=file_path, scenario_name="Retrieval Articles Scenario")
# Clean up the temporary file
os.remove(file_path)

Step 3: Register the embedding model and vector DB with Okareo

Okareo supports any embedding model or vector DB. Some of these have direct built-in support such as Pinecone and QDrant vector DBs, and the Cohere embedding model, but any can be supported through our CustomModel class. 

In this example we're using ChromaDB and its built-in embedding model by creating a new class by using Okareo's CustomModel class as a base. The code below shows a custom embedding model being defined and registered with Okareo. 

Later, when Okareo calls the custom model's invoke endpoint, this will return the top 5 most relevant results from ChromaDB (this is specified with n_results=5, but you may choose to change this number according to your evaluation results). The query_results_to_score function converts the results into the JSON format that Okareo expects.

### Create custom embedding model and register it ###
# A function to convert the query results from our ChromaDB collection into a list of dictionaries with the document ID, score, metadata, and label
def query_results_to_score(results):
    parsed_ids_with_scores = []
    for i in range(0, len(results['distances'][0])):
        # Create a score based on cosine similarity
        score = (2 - results['distances'][0][i]) / 2
        parsed_ids_with_scores.append(
            {
                "id": results['ids'][0][i],
                "score": score,
                "metadata": results['metadatas'][0][i],
                "label": f"{results['metadatas'][0][i]['article_type']} WebBizz Article w/ ID: {results['ids'][0][i]}"
            }
        )
    return parsed_ids_with_scores
# Define a custom retrieval model that uses the ChromaDB collection to retrieve documents
# The model will return the top 5 most relevant documents based on the input query
class CustomEmbeddingModel(CustomModel):
    def invoke(self, input: str) -> ModelInvocation:
        # Query the collection with the input text
        results = collection.query(
            query_texts=[input],
            n_results=5
        )
        # Return formatted query results and the model response context
        return ModelInvocation(model_prediction=query_results_to_score(results), model_output_metadata={'model_data': input})
# Register the model with Okareo
# This will return a model if it already exists or create a new one if it doesn't
model_under_test = okareo.register_model(name="vectordb_retrieval_test", model=CustomEmbeddingModel(name="custom retrieval"))

Step 4: Decide which metrics and other criteria are most important for your evaluation

Okareo offers a number of industry standard metrics for evaluating RAG retrieval systems, including accuracy, precision, recall, NDCG, MRR and MAP. You can specify which you want to use with the metrics_kwargs parameter (see code in the next section).

As well as the metrics specified above, performance and efficiency of retrieval are also important. The reranker returns the top "k" most relevant results, with k being a value you need to decide upon. Too high a value of k will slow down the retrieval process, but if k is too low, you might miss out on relevant results. Working out the correct balance between speed and relevance can be tricky, but Okareo makes it easier by allowing you to evaluate against all the above metrics for different values of k so you can choose a value that gives a good balance. Specifying at_k_intervals = [1, 2, 3, 4, 5] and then defining each metric like "accuracy_at_k": at_k_intervals will record metrics for each value of k, and you'll be able to see how each metric performed for each value of k on Okareo's retrieval evaluation dashboard.

Screenshot of part of Okareo's RAG evaluation report dashboard

Visualization on Okareo's retrieval evaluation dashboard viewing each metric at different values of k.

As our example is a precise question-answering RAG, we've chosen k intervals up to a value of 5. For question answering, the answer is typically found within 1 or 2 documents, so setting k = 5 provides a reasonable margin of tolerance without introducing unnecessary overhead. But if you're doing document search, which tends to require retrieval of larger amounts of information, you might want to increase your value of k to 10.

Step 5: Run an Okareo evaluation on your model

The run_test method is the method that actually runs the evaluation on your model. You pass in your scenario, specify that the type of evaluation is a retrieval test, and pass in the different metrics and intervals of k that you want to use for your evaluation. 

Put all the code snippets together in your Okareo flow file and run it (with okareo run -f <your_flow_script>) When the evaluation has finished running, a link will take you to your evaluation results dashboard so you can drill into your results, see how the evaluation went, and decide on a value of k.

### Evaluating the custom embedding model ###
# Import the datetime module for timestamping
from datetime import datetime
# Define thresholds for the evaluation metrics
at_k_intervals = [1, 2, 3, 4, 5] 
# Perform a test run using the uploaded scenario set
test_run_item = model_under_test.run_test(
    scenario=scenario, # use the scenario from the scenario set uploaded earlier
    name=f"Retrieval Test Run - {datetime.now().strftime('%m-%d %H:%M:%S')}", # add a timestamp to the test run name
    test_run_type=TestRunType.INFORMATION_RETRIEVAL, # specify that we are running an information retrieval test
    calculate_metrics=True,
    # Define the evaluation metrics to calculate
    metrics_kwargs={
        "accuracy_at_k": at_k_intervals ,
        "precision_recall_at_k": at_k_intervals ,
        "ndcg_at_k": at_k_intervals,
        "mrr_at_k": at_k_intervals,
        "map_at_k": at_k_intervals,
    }
)
# Generate a link back to Okareo for evaluation visualization
model_results = test_run_item.model_metrics.to_dict()
app_link = test_run_item.app_link
print(f"See results in Okareo: {app_link}")

Step 6: Interpret your evaluation results

The results dashboard begins with a metrics overview graph, which shows how each metric scored on average for each value of k. This is shown with <metric_name>@K.

Screenshot of the metrics overview graph in Okareo's RAG evaluation results dashboard

Hovering over a specific value of k (represented on the x axis), you can see the average score for each metric across all your scenarios.

When interpreting these results, you need to consider what is an acceptable level of precision for your specific RAG system. This may depend on whether its purpose is to answer a specific question or to do document search, or on how many documents are in your vector database. The WebBizz example set only has ten documents, so there is often only going to be one relevant result. This means that when k is increased, the average precision will necessarily drop.

In this example, we can see that all the metrics except average precision are higher than when k=1, but there is not such a big difference between k=2 and k=3. Hence, the best value of k is probably 2. However, if average precision needs to be higher, then this may suggest you need to make improvements to the re-ranking model to ensure that the most relevant result is always returned.

Scrolling down the page shows some row metrics. You can see the accuracy, precision, recall, MRR, NDCG, and MAP (when k=1) for each individual input query from your scenario set. Each row is one scenario from the scenario set above. Using the "Metrics @ k" slider allows you to see the different metric values for different values of k.

Screen capture of row metrics in Okareo's RAG evaluation results dashboard

When looking at row metrics, you can filter by one particular metric to discover which scenarios failed and why. Let's look at an individual accuracy metric to understand this in more depth:

Accuracy just indicates whether or not you got a relevant result in your response, so the value will always be 1.00 or 0.00. Filtering by accuracy, use the numeric slider to see only the rows with less than 100% accuracy. Using the Metrics@k slider, you can see that for k=1 there were 3 inaccuracies, for k=2 there were 2 and for k=3 or higher there were no inaccuracies.

To find out why there were inaccuracies, you can drill down further by clicking on the expanding arrows for each row to get more details. Doing this for the first inaccurate row, you'll see that the most relevant article (the one with the same ID as the expected result) is the third in the list. If k=1 or 2, this result won't get returned, so this will be marked as an inaccuracy. 

Screen capture of further details of the accuracy metrics in Okareo's RAG evaluation results dashboard

These more detailed metrics allow you to see why something failed and to gain insight into which part of the system might be at fault. In the screen capture above, you can see the top 5 results returned by the retrieval system, and the expected result is only third in the list. The document IDs are listed, which allows you to check the original data and determine what the problem is. Some examples of this include:

  • Your original scenario is wrong and there is actually a more relevant document that you missed

  • Multiple documents may be relevant, but you only selected one

  • The reranker is not doing a good enough job at ranking the most relevant articles at the top (or you're not using a reranker and should be!)

You can also drill down even further and view metadata. Earlier, you saved each category as metadata in your code, and now you can check and see what categories of article are being returned for your query.  The top result is a support article, which seems correct for the particular query.

Screenshot of metadata for each of the accuracy metrics in Okareo's RAG evaluation results dashboard

For comprehensive RAG evaluation, use Okareo

RAG evaluation is essential for optimizing the performance and accuracy of your RAG system. Each part of the RAG system needs to be evaluated in turn, and it's important to fully understand and do a good job of evaluating your retrieval system, as your downstream LLM evaluation is only as good as the evaluation of your previous components.

If you're using a classifier for intent detection, you should use Okareo to conduct in-depth RAG evaluations by first evaluating your classifier, then evaluating the retrieval system, and finally evaluate the LLM you're using. You can get started with Okareo by signing up today.

Share:

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.