The Best Embedding Model For RAG is The One That Best Fits Your Data

RAG

Boris Selitser

,

Co-founder of Okareo

August 12, 2024

Just like with any development project, when building a RAG one size does not fit all. The size and type of data you plan to retrieve from your RAG will sway many engineering decisions. One of the first and key decisions about a RAG is what embedding model you want to use.

So how do you decide which embedding model is best for your data?

And what does ‘best’ even mean in the RAG context?

These questions are what we will answer next. Note: You can run all the examples in this blog via these two notebooks:

Even more fun would be to plug in your own RAG data and see which embedding model does better. Primary tools we'll be using for this are Okareo and ChromaDB.

Why Your RAG Model Has to Fit Your Data

While many LLM models can be used in a RAG system, the performance of the resulting system may differ drastically based on the specific combination of your RAG and the LLM you choose. The best LLM according to a benchmark like MTEB may not perform best for your data.

The reason for this is, as we described in our recent post about LLM benchmarking (and why you likely need baselining instead), that benchmarks use extremely broad datasets for their scoring. If you take those benchmarks at face value and make a decision on the model to use based on them, that’s not a guarantee that you’ll get good performance out of your system.

Trusting MTEB for model choices for RAG is like choosing new trousers based on reviews. Even if the reviews are stellar, the trousers may just not fit you at all.

To support the importance of this point, some models that rank high on the MTEB benchmark may have been overfitted for the specific tasks that they are being evaluated on, with users reporting much lower performance with their own data.

A much better approach for choosing a model for your RAG system is baselining using your own data. This way, you evaluate the performance of your system over time with data and user interactions that are as close as possible to your production use case, reducing risk that you’ll need to redesign your system already after having implemented it.

How To Evaluate a Model for RAG Using Your Own Data

While evaluating a RAG using your own data may sound like a lot of work, it doesn’t have to be. In its essence, this task comes down to three steps:

  1. Generating RAG questions that are similar to your users’ interactions.

  2. Determining the metrics you care about and the process of measuring them.

  3. Running the comparison across models, with your RAG questions, and measuring the metrics that you decided are important.

Let’s go through these steps one by one.

Step 1: Generate Your RAG Questions

One of the most straightforward and popular uses of RAG is some sort of chatbot question answering. To capture this we want a meaningful number of questions that represent our typical users and data they will be accessing. Where do we get this ‘meaningful number’ of questions? We synthetically generate them. A good starting point is 100+ questions. Okareo SDKs (Python and Typescript) have a quick way of giving us this starting point.

Here is the key bit of Python code that generates the RAG questions:

okareo = Okareo(OKAREO_API_KEY) 
random_string = ''.join(random.choices(string.ascii_letters, k=5))
# Use the scenario set of documents to generate a scenario of questions 
generated_scenario = okareo.generate_scenario_set(
  ScenarioSetGenerate(
    name=f"Retrieval - Generated Scenario - {random_string}",
    source_scenario_id=document_scenario.scenario_id,
    number_examples=4, # Number of questions to generate for each document
    generation_type=ScenarioType.TEXT_REVERSE_QUESTION, # This type is for questions from the text
    generation_tone=GenerationTone.INFORMAL, # Specifying tone of the generated questions
    post_template="""{"question": "{generation.input}", "document": "{input}"}""",# for easy validation we are generating questions next to source documents
  )
)

# Print a link back to Okareo app to see the generated scenario 
print(f"See generated scenario in Okareo app: {generated_scenario.app_link}")

Note: The full notebook for Generating Your RAG Questions is available here

Step 2: Determine Key Metrics for your RAG System

Now that we have the questions for evaluation, it’s time to see if the RAG vector database returns the relevant documents for each question. As mentioned, the quality of results from the database is principally determined by the embedding model. How do we measure how often and in what order are the relevant documents returned across 100+ questions? If you have been playing with vector DBs, you also know about k value. k value is basically how many of the top scoring documents (via embedding function) should be returned by the vector DB. This can get a bit hairy, and sometimes even daunting. Let’s focus on a few simple metrics that get the job done most of the time.

  • Accuracy - For each question do retrieved documents contain at least one relevant document.

  • Recall - For each question do retrieved documents contain all the relevant documents.

  • MRR (Mean Reciprocal Rank) - How early in the results does the first relevant document appear. With the top result scoring the highest and the last result scoring the lowest.

Metrics above tell us if the vector DB together with the embedding model are giving us relevant results. But just like with any engineering decision the overall performance of the system needs to also consider latency and cost, among other factors. We need to consider the cost to maintain and operate the implementation on some infrastructure. We’ll try to bring all these factors together in the next section.

Step 3: Let’s Kick Off This Model Showdown!

Note: The full embedding model comparison is laid out in this notebook: Model Showdown

We’ll be using ChromaDB to compare the different embedding models in the rest of this blog. It’s quick to set up and vector DB choice has less bearing on the embedding model selection.

We will start with all-MiniLM-L6-v2. It's a lightweight general purpose model from Sentence Transformers (free!) and it’s also set as the default embedding model for ChromaDB.

The interesting code portion for running a 100+ questions we created earlier against this model is:

embedding_model_name = "all-MiniLM-L6-v2" # This is the default SentenceTransformer model that ChromaDB uses to embed the documents 
default_sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2") 
model_under_test, collection = create_vector_collection(embedding_model_name, default_sentence_transformer_ef) 

# Perform a test run using the uploaded scenario  
test_run_item = model_under_test.run_test(
  scenario=scenario, # use the scenario uploaded earlier in this notebook
  name=f"RAG Comparison {embedding_model_name} - {random_string}",
  test_run_type=TestRunType.INFORMATION_RETRIEVAL, # specify that we are running a retrieval test
)

# Print a link back to Okareo app for evaluation visualization 
print(f"See results in Okareo app for embedding model {embedding_model_name}: {test_run_item.app_link}")

After running above here are the results you will see in Okareo app:

The results of evaluation for the all-MiniLM-L6-v2 model.

The results of evaluation for the all-MiniLM-L6-v2 model.

The leftmost card tells us that the chances of the first result being relevant are about 66%, which is not great. If we consider the top 3 results (middle card), Accuracy and Recall improve to ~85%, and that’s better but still not great. These improve further to 93% if we consider the top 5 results (right card). Taking the top 5 results would make sense if we expected several relevant documents, say 2-3 for every question. But in this evaluation we are expecting an exact answer from a single document. In this case sending a context of 5 documents, 4 of which are not relevant is going to confuse the LLM trying to generate an answer to the user question. This has been shown repeatedly in research and practical implementations that LLMs get confused by a lot of irrelevant context, irrespective of the context window maximum.

It would not be surprising if we try one of the embedding models from OpenAI. Let’s try text-embedding-3-large, the most powerful model at the time of writing this.

The results of evaluation for the text-embedding-3-large model.

The results of evaluation for the text-embedding-3-large model.

This is a major improvement overall. If we are considering the difference from top 3 to top 5 results Accuracy and Recall don’t improve much. This means most of the relevant documents are returned in the top 3 or not at all. Passing a context of 3 documents (selecting k = 3) gives us a much tighter context window for LLM in the answer generation step.

At this point, it’s worth mentioning latency and cost. all-MiniLM-L6-v2 runs locally and consumes only about 90 MB of memory. You still need to dedicate compute resources to it and size infrastructure proportionally to how much data volume you plan to ingest into vector DB/store and your retrieval throughput. Obviously something you need to operate and maintain yourself, but it has a relatively small footprint. text-embedding-3-large from OpenAI sits behind an API which makes it worry-free from the maintenance side. Being a much larger model and sitting behind an API also brings the cost (in tokens) and latency implications. For every document or question embedding one needs to make an API trip back and forth. For reference, on my average Mac machine, embedding 30 documents and 100+ questions took about twice as long with text-embedding-3-large (~52 sec) vs. all-MiniLM-L6-v2 (~25 sec).

Is there a compromise? We will try gte-small model from Alibaba DAMO Academy. It's fairly compact (120 MB memory) and something you can run locally.

The results of evaluation for the gte-small model.

The results of evaluation for the gte-small model.

Performance when considering top 3 results is close to that of text-embedding-3-large and is actually better in Accuracy and Recall when looking at top 5 results. MRR is slightly worse as text-embedding-3-large tends to return relevant results closer to the top. Being a much smaller model overall, gte-small is on par with all-MiniLM-L6-v2 also in vector size (384 embedding dimension) and latency. Smaller vector size (384 vs. 3072 for text-embedding-3-large) also means it will be faster at ingesting new documents and executing queries against the vector DB.

Here is a summary of all three models with k = 3:

All three models that we evaluated side by side.

All three models that we evaluated side by side.

Summary: The Best Embedding Model for RAG Is…

As we mentioned in the introduction, there is not going to be one best model for every RAG. So unfortunately, we can’t give you a perfect answer here.

What we can say, though, is that in this article we described the key decision criteria that you may want to use for determining the best RAG model for your use case.

We also described three models which you may want to evaluate, and there are many more models to consider. The MTEB List is a good starting point. The key point to remember is that you always want to evaluate it with your own data.

Ready to run your own custom evaluation? Sign up for Okareo if you haven’t already, and then follow the steps in this article or in the Compare Embedding Models documentation page.

Just like with any development project, when building a RAG one size does not fit all. The size and type of data you plan to retrieve from your RAG will sway many engineering decisions. One of the first and key decisions about a RAG is what embedding model you want to use.

So how do you decide which embedding model is best for your data?

And what does ‘best’ even mean in the RAG context?

These questions are what we will answer next. Note: You can run all the examples in this blog via these two notebooks:

Even more fun would be to plug in your own RAG data and see which embedding model does better. Primary tools we'll be using for this are Okareo and ChromaDB.

Why Your RAG Model Has to Fit Your Data

While many LLM models can be used in a RAG system, the performance of the resulting system may differ drastically based on the specific combination of your RAG and the LLM you choose. The best LLM according to a benchmark like MTEB may not perform best for your data.

The reason for this is, as we described in our recent post about LLM benchmarking (and why you likely need baselining instead), that benchmarks use extremely broad datasets for their scoring. If you take those benchmarks at face value and make a decision on the model to use based on them, that’s not a guarantee that you’ll get good performance out of your system.

Trusting MTEB for model choices for RAG is like choosing new trousers based on reviews. Even if the reviews are stellar, the trousers may just not fit you at all.

To support the importance of this point, some models that rank high on the MTEB benchmark may have been overfitted for the specific tasks that they are being evaluated on, with users reporting much lower performance with their own data.

A much better approach for choosing a model for your RAG system is baselining using your own data. This way, you evaluate the performance of your system over time with data and user interactions that are as close as possible to your production use case, reducing risk that you’ll need to redesign your system already after having implemented it.

How To Evaluate a Model for RAG Using Your Own Data

While evaluating a RAG using your own data may sound like a lot of work, it doesn’t have to be. In its essence, this task comes down to three steps:

  1. Generating RAG questions that are similar to your users’ interactions.

  2. Determining the metrics you care about and the process of measuring them.

  3. Running the comparison across models, with your RAG questions, and measuring the metrics that you decided are important.

Let’s go through these steps one by one.

Step 1: Generate Your RAG Questions

One of the most straightforward and popular uses of RAG is some sort of chatbot question answering. To capture this we want a meaningful number of questions that represent our typical users and data they will be accessing. Where do we get this ‘meaningful number’ of questions? We synthetically generate them. A good starting point is 100+ questions. Okareo SDKs (Python and Typescript) have a quick way of giving us this starting point.

Here is the key bit of Python code that generates the RAG questions:

okareo = Okareo(OKAREO_API_KEY) 
random_string = ''.join(random.choices(string.ascii_letters, k=5))
# Use the scenario set of documents to generate a scenario of questions 
generated_scenario = okareo.generate_scenario_set(
  ScenarioSetGenerate(
    name=f"Retrieval - Generated Scenario - {random_string}",
    source_scenario_id=document_scenario.scenario_id,
    number_examples=4, # Number of questions to generate for each document
    generation_type=ScenarioType.TEXT_REVERSE_QUESTION, # This type is for questions from the text
    generation_tone=GenerationTone.INFORMAL, # Specifying tone of the generated questions
    post_template="""{"question": "{generation.input}", "document": "{input}"}""",# for easy validation we are generating questions next to source documents
  )
)

# Print a link back to Okareo app to see the generated scenario 
print(f"See generated scenario in Okareo app: {generated_scenario.app_link}")

Note: The full notebook for Generating Your RAG Questions is available here

Step 2: Determine Key Metrics for your RAG System

Now that we have the questions for evaluation, it’s time to see if the RAG vector database returns the relevant documents for each question. As mentioned, the quality of results from the database is principally determined by the embedding model. How do we measure how often and in what order are the relevant documents returned across 100+ questions? If you have been playing with vector DBs, you also know about k value. k value is basically how many of the top scoring documents (via embedding function) should be returned by the vector DB. This can get a bit hairy, and sometimes even daunting. Let’s focus on a few simple metrics that get the job done most of the time.

  • Accuracy - For each question do retrieved documents contain at least one relevant document.

  • Recall - For each question do retrieved documents contain all the relevant documents.

  • MRR (Mean Reciprocal Rank) - How early in the results does the first relevant document appear. With the top result scoring the highest and the last result scoring the lowest.

Metrics above tell us if the vector DB together with the embedding model are giving us relevant results. But just like with any engineering decision the overall performance of the system needs to also consider latency and cost, among other factors. We need to consider the cost to maintain and operate the implementation on some infrastructure. We’ll try to bring all these factors together in the next section.

Step 3: Let’s Kick Off This Model Showdown!

Note: The full embedding model comparison is laid out in this notebook: Model Showdown

We’ll be using ChromaDB to compare the different embedding models in the rest of this blog. It’s quick to set up and vector DB choice has less bearing on the embedding model selection.

We will start with all-MiniLM-L6-v2. It's a lightweight general purpose model from Sentence Transformers (free!) and it’s also set as the default embedding model for ChromaDB.

The interesting code portion for running a 100+ questions we created earlier against this model is:

embedding_model_name = "all-MiniLM-L6-v2" # This is the default SentenceTransformer model that ChromaDB uses to embed the documents 
default_sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2") 
model_under_test, collection = create_vector_collection(embedding_model_name, default_sentence_transformer_ef) 

# Perform a test run using the uploaded scenario  
test_run_item = model_under_test.run_test(
  scenario=scenario, # use the scenario uploaded earlier in this notebook
  name=f"RAG Comparison {embedding_model_name} - {random_string}",
  test_run_type=TestRunType.INFORMATION_RETRIEVAL, # specify that we are running a retrieval test
)

# Print a link back to Okareo app for evaluation visualization 
print(f"See results in Okareo app for embedding model {embedding_model_name}: {test_run_item.app_link}")

After running above here are the results you will see in Okareo app:

The results of evaluation for the all-MiniLM-L6-v2 model.

The results of evaluation for the all-MiniLM-L6-v2 model.

The leftmost card tells us that the chances of the first result being relevant are about 66%, which is not great. If we consider the top 3 results (middle card), Accuracy and Recall improve to ~85%, and that’s better but still not great. These improve further to 93% if we consider the top 5 results (right card). Taking the top 5 results would make sense if we expected several relevant documents, say 2-3 for every question. But in this evaluation we are expecting an exact answer from a single document. In this case sending a context of 5 documents, 4 of which are not relevant is going to confuse the LLM trying to generate an answer to the user question. This has been shown repeatedly in research and practical implementations that LLMs get confused by a lot of irrelevant context, irrespective of the context window maximum.

It would not be surprising if we try one of the embedding models from OpenAI. Let’s try text-embedding-3-large, the most powerful model at the time of writing this.

The results of evaluation for the text-embedding-3-large model.

The results of evaluation for the text-embedding-3-large model.

This is a major improvement overall. If we are considering the difference from top 3 to top 5 results Accuracy and Recall don’t improve much. This means most of the relevant documents are returned in the top 3 or not at all. Passing a context of 3 documents (selecting k = 3) gives us a much tighter context window for LLM in the answer generation step.

At this point, it’s worth mentioning latency and cost. all-MiniLM-L6-v2 runs locally and consumes only about 90 MB of memory. You still need to dedicate compute resources to it and size infrastructure proportionally to how much data volume you plan to ingest into vector DB/store and your retrieval throughput. Obviously something you need to operate and maintain yourself, but it has a relatively small footprint. text-embedding-3-large from OpenAI sits behind an API which makes it worry-free from the maintenance side. Being a much larger model and sitting behind an API also brings the cost (in tokens) and latency implications. For every document or question embedding one needs to make an API trip back and forth. For reference, on my average Mac machine, embedding 30 documents and 100+ questions took about twice as long with text-embedding-3-large (~52 sec) vs. all-MiniLM-L6-v2 (~25 sec).

Is there a compromise? We will try gte-small model from Alibaba DAMO Academy. It's fairly compact (120 MB memory) and something you can run locally.

The results of evaluation for the gte-small model.

The results of evaluation for the gte-small model.

Performance when considering top 3 results is close to that of text-embedding-3-large and is actually better in Accuracy and Recall when looking at top 5 results. MRR is slightly worse as text-embedding-3-large tends to return relevant results closer to the top. Being a much smaller model overall, gte-small is on par with all-MiniLM-L6-v2 also in vector size (384 embedding dimension) and latency. Smaller vector size (384 vs. 3072 for text-embedding-3-large) also means it will be faster at ingesting new documents and executing queries against the vector DB.

Here is a summary of all three models with k = 3:

All three models that we evaluated side by side.

All three models that we evaluated side by side.

Summary: The Best Embedding Model for RAG Is…

As we mentioned in the introduction, there is not going to be one best model for every RAG. So unfortunately, we can’t give you a perfect answer here.

What we can say, though, is that in this article we described the key decision criteria that you may want to use for determining the best RAG model for your use case.

We also described three models which you may want to evaluate, and there are many more models to consider. The MTEB List is a good starting point. The key point to remember is that you always want to evaluate it with your own data.

Ready to run your own custom evaluation? Sign up for Okareo if you haven’t already, and then follow the steps in this article or in the Compare Embedding Models documentation page.

Just like with any development project, when building a RAG one size does not fit all. The size and type of data you plan to retrieve from your RAG will sway many engineering decisions. One of the first and key decisions about a RAG is what embedding model you want to use.

So how do you decide which embedding model is best for your data?

And what does ‘best’ even mean in the RAG context?

These questions are what we will answer next. Note: You can run all the examples in this blog via these two notebooks:

Even more fun would be to plug in your own RAG data and see which embedding model does better. Primary tools we'll be using for this are Okareo and ChromaDB.

Why Your RAG Model Has to Fit Your Data

While many LLM models can be used in a RAG system, the performance of the resulting system may differ drastically based on the specific combination of your RAG and the LLM you choose. The best LLM according to a benchmark like MTEB may not perform best for your data.

The reason for this is, as we described in our recent post about LLM benchmarking (and why you likely need baselining instead), that benchmarks use extremely broad datasets for their scoring. If you take those benchmarks at face value and make a decision on the model to use based on them, that’s not a guarantee that you’ll get good performance out of your system.

Trusting MTEB for model choices for RAG is like choosing new trousers based on reviews. Even if the reviews are stellar, the trousers may just not fit you at all.

To support the importance of this point, some models that rank high on the MTEB benchmark may have been overfitted for the specific tasks that they are being evaluated on, with users reporting much lower performance with their own data.

A much better approach for choosing a model for your RAG system is baselining using your own data. This way, you evaluate the performance of your system over time with data and user interactions that are as close as possible to your production use case, reducing risk that you’ll need to redesign your system already after having implemented it.

How To Evaluate a Model for RAG Using Your Own Data

While evaluating a RAG using your own data may sound like a lot of work, it doesn’t have to be. In its essence, this task comes down to three steps:

  1. Generating RAG questions that are similar to your users’ interactions.

  2. Determining the metrics you care about and the process of measuring them.

  3. Running the comparison across models, with your RAG questions, and measuring the metrics that you decided are important.

Let’s go through these steps one by one.

Step 1: Generate Your RAG Questions

One of the most straightforward and popular uses of RAG is some sort of chatbot question answering. To capture this we want a meaningful number of questions that represent our typical users and data they will be accessing. Where do we get this ‘meaningful number’ of questions? We synthetically generate them. A good starting point is 100+ questions. Okareo SDKs (Python and Typescript) have a quick way of giving us this starting point.

Here is the key bit of Python code that generates the RAG questions:

okareo = Okareo(OKAREO_API_KEY) 
random_string = ''.join(random.choices(string.ascii_letters, k=5))
# Use the scenario set of documents to generate a scenario of questions 
generated_scenario = okareo.generate_scenario_set(
  ScenarioSetGenerate(
    name=f"Retrieval - Generated Scenario - {random_string}",
    source_scenario_id=document_scenario.scenario_id,
    number_examples=4, # Number of questions to generate for each document
    generation_type=ScenarioType.TEXT_REVERSE_QUESTION, # This type is for questions from the text
    generation_tone=GenerationTone.INFORMAL, # Specifying tone of the generated questions
    post_template="""{"question": "{generation.input}", "document": "{input}"}""",# for easy validation we are generating questions next to source documents
  )
)

# Print a link back to Okareo app to see the generated scenario 
print(f"See generated scenario in Okareo app: {generated_scenario.app_link}")

Note: The full notebook for Generating Your RAG Questions is available here

Step 2: Determine Key Metrics for your RAG System

Now that we have the questions for evaluation, it’s time to see if the RAG vector database returns the relevant documents for each question. As mentioned, the quality of results from the database is principally determined by the embedding model. How do we measure how often and in what order are the relevant documents returned across 100+ questions? If you have been playing with vector DBs, you also know about k value. k value is basically how many of the top scoring documents (via embedding function) should be returned by the vector DB. This can get a bit hairy, and sometimes even daunting. Let’s focus on a few simple metrics that get the job done most of the time.

  • Accuracy - For each question do retrieved documents contain at least one relevant document.

  • Recall - For each question do retrieved documents contain all the relevant documents.

  • MRR (Mean Reciprocal Rank) - How early in the results does the first relevant document appear. With the top result scoring the highest and the last result scoring the lowest.

Metrics above tell us if the vector DB together with the embedding model are giving us relevant results. But just like with any engineering decision the overall performance of the system needs to also consider latency and cost, among other factors. We need to consider the cost to maintain and operate the implementation on some infrastructure. We’ll try to bring all these factors together in the next section.

Step 3: Let’s Kick Off This Model Showdown!

Note: The full embedding model comparison is laid out in this notebook: Model Showdown

We’ll be using ChromaDB to compare the different embedding models in the rest of this blog. It’s quick to set up and vector DB choice has less bearing on the embedding model selection.

We will start with all-MiniLM-L6-v2. It's a lightweight general purpose model from Sentence Transformers (free!) and it’s also set as the default embedding model for ChromaDB.

The interesting code portion for running a 100+ questions we created earlier against this model is:

embedding_model_name = "all-MiniLM-L6-v2" # This is the default SentenceTransformer model that ChromaDB uses to embed the documents 
default_sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2") 
model_under_test, collection = create_vector_collection(embedding_model_name, default_sentence_transformer_ef) 

# Perform a test run using the uploaded scenario  
test_run_item = model_under_test.run_test(
  scenario=scenario, # use the scenario uploaded earlier in this notebook
  name=f"RAG Comparison {embedding_model_name} - {random_string}",
  test_run_type=TestRunType.INFORMATION_RETRIEVAL, # specify that we are running a retrieval test
)

# Print a link back to Okareo app for evaluation visualization 
print(f"See results in Okareo app for embedding model {embedding_model_name}: {test_run_item.app_link}")

After running above here are the results you will see in Okareo app:

The results of evaluation for the all-MiniLM-L6-v2 model.

The results of evaluation for the all-MiniLM-L6-v2 model.

The leftmost card tells us that the chances of the first result being relevant are about 66%, which is not great. If we consider the top 3 results (middle card), Accuracy and Recall improve to ~85%, and that’s better but still not great. These improve further to 93% if we consider the top 5 results (right card). Taking the top 5 results would make sense if we expected several relevant documents, say 2-3 for every question. But in this evaluation we are expecting an exact answer from a single document. In this case sending a context of 5 documents, 4 of which are not relevant is going to confuse the LLM trying to generate an answer to the user question. This has been shown repeatedly in research and practical implementations that LLMs get confused by a lot of irrelevant context, irrespective of the context window maximum.

It would not be surprising if we try one of the embedding models from OpenAI. Let’s try text-embedding-3-large, the most powerful model at the time of writing this.

The results of evaluation for the text-embedding-3-large model.

The results of evaluation for the text-embedding-3-large model.

This is a major improvement overall. If we are considering the difference from top 3 to top 5 results Accuracy and Recall don’t improve much. This means most of the relevant documents are returned in the top 3 or not at all. Passing a context of 3 documents (selecting k = 3) gives us a much tighter context window for LLM in the answer generation step.

At this point, it’s worth mentioning latency and cost. all-MiniLM-L6-v2 runs locally and consumes only about 90 MB of memory. You still need to dedicate compute resources to it and size infrastructure proportionally to how much data volume you plan to ingest into vector DB/store and your retrieval throughput. Obviously something you need to operate and maintain yourself, but it has a relatively small footprint. text-embedding-3-large from OpenAI sits behind an API which makes it worry-free from the maintenance side. Being a much larger model and sitting behind an API also brings the cost (in tokens) and latency implications. For every document or question embedding one needs to make an API trip back and forth. For reference, on my average Mac machine, embedding 30 documents and 100+ questions took about twice as long with text-embedding-3-large (~52 sec) vs. all-MiniLM-L6-v2 (~25 sec).

Is there a compromise? We will try gte-small model from Alibaba DAMO Academy. It's fairly compact (120 MB memory) and something you can run locally.

The results of evaluation for the gte-small model.

The results of evaluation for the gte-small model.

Performance when considering top 3 results is close to that of text-embedding-3-large and is actually better in Accuracy and Recall when looking at top 5 results. MRR is slightly worse as text-embedding-3-large tends to return relevant results closer to the top. Being a much smaller model overall, gte-small is on par with all-MiniLM-L6-v2 also in vector size (384 embedding dimension) and latency. Smaller vector size (384 vs. 3072 for text-embedding-3-large) also means it will be faster at ingesting new documents and executing queries against the vector DB.

Here is a summary of all three models with k = 3:

All three models that we evaluated side by side.

All three models that we evaluated side by side.

Summary: The Best Embedding Model for RAG Is…

As we mentioned in the introduction, there is not going to be one best model for every RAG. So unfortunately, we can’t give you a perfect answer here.

What we can say, though, is that in this article we described the key decision criteria that you may want to use for determining the best RAG model for your use case.

We also described three models which you may want to evaluate, and there are many more models to consider. The MTEB List is a good starting point. The key point to remember is that you always want to evaluate it with your own data.

Ready to run your own custom evaluation? Sign up for Okareo if you haven’t already, and then follow the steps in this article or in the Compare Embedding Models documentation page.

Just like with any development project, when building a RAG one size does not fit all. The size and type of data you plan to retrieve from your RAG will sway many engineering decisions. One of the first and key decisions about a RAG is what embedding model you want to use.

So how do you decide which embedding model is best for your data?

And what does ‘best’ even mean in the RAG context?

These questions are what we will answer next. Note: You can run all the examples in this blog via these two notebooks:

Even more fun would be to plug in your own RAG data and see which embedding model does better. Primary tools we'll be using for this are Okareo and ChromaDB.

Why Your RAG Model Has to Fit Your Data

While many LLM models can be used in a RAG system, the performance of the resulting system may differ drastically based on the specific combination of your RAG and the LLM you choose. The best LLM according to a benchmark like MTEB may not perform best for your data.

The reason for this is, as we described in our recent post about LLM benchmarking (and why you likely need baselining instead), that benchmarks use extremely broad datasets for their scoring. If you take those benchmarks at face value and make a decision on the model to use based on them, that’s not a guarantee that you’ll get good performance out of your system.

Trusting MTEB for model choices for RAG is like choosing new trousers based on reviews. Even if the reviews are stellar, the trousers may just not fit you at all.

To support the importance of this point, some models that rank high on the MTEB benchmark may have been overfitted for the specific tasks that they are being evaluated on, with users reporting much lower performance with their own data.

A much better approach for choosing a model for your RAG system is baselining using your own data. This way, you evaluate the performance of your system over time with data and user interactions that are as close as possible to your production use case, reducing risk that you’ll need to redesign your system already after having implemented it.

How To Evaluate a Model for RAG Using Your Own Data

While evaluating a RAG using your own data may sound like a lot of work, it doesn’t have to be. In its essence, this task comes down to three steps:

  1. Generating RAG questions that are similar to your users’ interactions.

  2. Determining the metrics you care about and the process of measuring them.

  3. Running the comparison across models, with your RAG questions, and measuring the metrics that you decided are important.

Let’s go through these steps one by one.

Step 1: Generate Your RAG Questions

One of the most straightforward and popular uses of RAG is some sort of chatbot question answering. To capture this we want a meaningful number of questions that represent our typical users and data they will be accessing. Where do we get this ‘meaningful number’ of questions? We synthetically generate them. A good starting point is 100+ questions. Okareo SDKs (Python and Typescript) have a quick way of giving us this starting point.

Here is the key bit of Python code that generates the RAG questions:

okareo = Okareo(OKAREO_API_KEY) 
random_string = ''.join(random.choices(string.ascii_letters, k=5))
# Use the scenario set of documents to generate a scenario of questions 
generated_scenario = okareo.generate_scenario_set(
  ScenarioSetGenerate(
    name=f"Retrieval - Generated Scenario - {random_string}",
    source_scenario_id=document_scenario.scenario_id,
    number_examples=4, # Number of questions to generate for each document
    generation_type=ScenarioType.TEXT_REVERSE_QUESTION, # This type is for questions from the text
    generation_tone=GenerationTone.INFORMAL, # Specifying tone of the generated questions
    post_template="""{"question": "{generation.input}", "document": "{input}"}""",# for easy validation we are generating questions next to source documents
  )
)

# Print a link back to Okareo app to see the generated scenario 
print(f"See generated scenario in Okareo app: {generated_scenario.app_link}")

Note: The full notebook for Generating Your RAG Questions is available here

Step 2: Determine Key Metrics for your RAG System

Now that we have the questions for evaluation, it’s time to see if the RAG vector database returns the relevant documents for each question. As mentioned, the quality of results from the database is principally determined by the embedding model. How do we measure how often and in what order are the relevant documents returned across 100+ questions? If you have been playing with vector DBs, you also know about k value. k value is basically how many of the top scoring documents (via embedding function) should be returned by the vector DB. This can get a bit hairy, and sometimes even daunting. Let’s focus on a few simple metrics that get the job done most of the time.

  • Accuracy - For each question do retrieved documents contain at least one relevant document.

  • Recall - For each question do retrieved documents contain all the relevant documents.

  • MRR (Mean Reciprocal Rank) - How early in the results does the first relevant document appear. With the top result scoring the highest and the last result scoring the lowest.

Metrics above tell us if the vector DB together with the embedding model are giving us relevant results. But just like with any engineering decision the overall performance of the system needs to also consider latency and cost, among other factors. We need to consider the cost to maintain and operate the implementation on some infrastructure. We’ll try to bring all these factors together in the next section.

Step 3: Let’s Kick Off This Model Showdown!

Note: The full embedding model comparison is laid out in this notebook: Model Showdown

We’ll be using ChromaDB to compare the different embedding models in the rest of this blog. It’s quick to set up and vector DB choice has less bearing on the embedding model selection.

We will start with all-MiniLM-L6-v2. It's a lightweight general purpose model from Sentence Transformers (free!) and it’s also set as the default embedding model for ChromaDB.

The interesting code portion for running a 100+ questions we created earlier against this model is:

embedding_model_name = "all-MiniLM-L6-v2" # This is the default SentenceTransformer model that ChromaDB uses to embed the documents 
default_sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2") 
model_under_test, collection = create_vector_collection(embedding_model_name, default_sentence_transformer_ef) 

# Perform a test run using the uploaded scenario  
test_run_item = model_under_test.run_test(
  scenario=scenario, # use the scenario uploaded earlier in this notebook
  name=f"RAG Comparison {embedding_model_name} - {random_string}",
  test_run_type=TestRunType.INFORMATION_RETRIEVAL, # specify that we are running a retrieval test
)

# Print a link back to Okareo app for evaluation visualization 
print(f"See results in Okareo app for embedding model {embedding_model_name}: {test_run_item.app_link}")

After running above here are the results you will see in Okareo app:

The results of evaluation for the all-MiniLM-L6-v2 model.

The results of evaluation for the all-MiniLM-L6-v2 model.

The leftmost card tells us that the chances of the first result being relevant are about 66%, which is not great. If we consider the top 3 results (middle card), Accuracy and Recall improve to ~85%, and that’s better but still not great. These improve further to 93% if we consider the top 5 results (right card). Taking the top 5 results would make sense if we expected several relevant documents, say 2-3 for every question. But in this evaluation we are expecting an exact answer from a single document. In this case sending a context of 5 documents, 4 of which are not relevant is going to confuse the LLM trying to generate an answer to the user question. This has been shown repeatedly in research and practical implementations that LLMs get confused by a lot of irrelevant context, irrespective of the context window maximum.

It would not be surprising if we try one of the embedding models from OpenAI. Let’s try text-embedding-3-large, the most powerful model at the time of writing this.

The results of evaluation for the text-embedding-3-large model.

The results of evaluation for the text-embedding-3-large model.

This is a major improvement overall. If we are considering the difference from top 3 to top 5 results Accuracy and Recall don’t improve much. This means most of the relevant documents are returned in the top 3 or not at all. Passing a context of 3 documents (selecting k = 3) gives us a much tighter context window for LLM in the answer generation step.

At this point, it’s worth mentioning latency and cost. all-MiniLM-L6-v2 runs locally and consumes only about 90 MB of memory. You still need to dedicate compute resources to it and size infrastructure proportionally to how much data volume you plan to ingest into vector DB/store and your retrieval throughput. Obviously something you need to operate and maintain yourself, but it has a relatively small footprint. text-embedding-3-large from OpenAI sits behind an API which makes it worry-free from the maintenance side. Being a much larger model and sitting behind an API also brings the cost (in tokens) and latency implications. For every document or question embedding one needs to make an API trip back and forth. For reference, on my average Mac machine, embedding 30 documents and 100+ questions took about twice as long with text-embedding-3-large (~52 sec) vs. all-MiniLM-L6-v2 (~25 sec).

Is there a compromise? We will try gte-small model from Alibaba DAMO Academy. It's fairly compact (120 MB memory) and something you can run locally.

The results of evaluation for the gte-small model.

The results of evaluation for the gte-small model.

Performance when considering top 3 results is close to that of text-embedding-3-large and is actually better in Accuracy and Recall when looking at top 5 results. MRR is slightly worse as text-embedding-3-large tends to return relevant results closer to the top. Being a much smaller model overall, gte-small is on par with all-MiniLM-L6-v2 also in vector size (384 embedding dimension) and latency. Smaller vector size (384 vs. 3072 for text-embedding-3-large) also means it will be faster at ingesting new documents and executing queries against the vector DB.

Here is a summary of all three models with k = 3:

All three models that we evaluated side by side.

All three models that we evaluated side by side.

Summary: The Best Embedding Model for RAG Is…

As we mentioned in the introduction, there is not going to be one best model for every RAG. So unfortunately, we can’t give you a perfect answer here.

What we can say, though, is that in this article we described the key decision criteria that you may want to use for determining the best RAG model for your use case.

We also described three models which you may want to evaluate, and there are many more models to consider. The MTEB List is a good starting point. The key point to remember is that you always want to evaluate it with your own data.

Ready to run your own custom evaluation? Sign up for Okareo if you haven’t already, and then follow the steps in this article or in the Compare Embedding Models documentation page.

Share:

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.