The Best Embedding Model For RAG is The One That Best Fits Your Data
RAG
Boris Selitser
,
Co-founder of Okareo
August 12, 2024
Just like with any development project, when building a RAG, one size does not fit all. The size and type of data you plan to retrieve from your RAG will sway many engineering decisions. One of the first and key decisions about a RAG is what embedding model you want to use.
So how do you decide which embedding model is best for your data? And what does “best” even mean in the RAG context? These questions are what we will answer next.
Note: You can run all the examples in this blog via these two notebooks:
Even more fun would be to plug in your own RAG data and see which embedding model does better. The primary tools we'll be using for this are Okareo and ChromaDB.
Why your RAG model has to fit your data
While many LLM models can be used in a RAG system, the performance of the resulting system may differ drastically based on the specific combination of your RAG and the LLM you choose. The best LLM according to a benchmark like MTEB may not perform best for your data.
The reason for this, as we described in our recent post about LLM benchmarking (and why you likely need baselining instead), is that benchmarks use extremely broad datasets for their scoring. If you take those benchmarks at face value and make a decision on the model to use based on them, there’s no guarantee that you’ll get good performance out of your system.
Trusting MTEB for model choices for RAG is like choosing new trousers based on reviews. Even if the reviews are stellar, the trousers may just not fit you at all.
Similarly, some models that rank high on the MTEB benchmark may have been overfitted for the specific tasks that they are being evaluated on, with users reporting much lower performance with their own data.
A much better approach for choosing a model for your RAG system is baselining using your own data. This way, you evaluate the performance of your system over time with data and user interactions that are as close as possible to your production use case, reducing risk that you’ll need to redesign your system already after having implemented it.
How to evaluate a model for RAG using your own data
Using your own data while evaluating a RAG may sound like a lot of work, it doesn’t have to be. This task essentially comes down to three steps:
Generating RAG questions that are similar to your users’ interactions
Determining the metrics you care about and the process of measuring them
Running the comparison across models with your RAG questions and measuring the metrics that you decided are important
Let’s go through these steps one by one.
Step 1: Generate your RAG questions
One of the most straightforward and popular uses of RAG is answering questions with some sort of chatbot. To capture this, we want a meaningful number of questions that represent our typical users and data they will be accessing. Where do we get this “meaningful number” of questions? We synthetically generate them. A good starting point is 100+ questions. Okareo SDKs (Python and Typescript) have a quick way of giving us this starting point.
Here is the key bit of Python code that generates the RAG questions:
okareo = Okareo(OKAREO_API_KEY)
random_string = ''.join(random.choices(string.ascii_letters, k=5))
# Use the scenario set of documents to generate a scenario of questions
generated_scenario = okareo.generate_scenario_set(
ScenarioSetGenerate(
name=f"Retrieval - Generated Scenario - {random_string}",
source_scenario_id=document_scenario.scenario_id,
number_examples=4, # Number of questions to generate for each document
generation_type=ScenarioType.TEXT_REVERSE_QUESTION, # This type is for questions from the text
generation_tone=GenerationTone.INFORMAL, # Specifying tone of the generated questions
post_template="""{"question": "{generation.input}", "document": "{input}"}""",# for easy validation we are generating questions next to source documents
)
)
# Print a link back to Okareo app to see the generated scenario
print(f"See generated scenario in Okareo app: {generated_scenario.app_link}")
Note: The full notebook for generating your RAG questions is available here
Step 2: Determine key metrics for your RAG system
Now that we have the questions for evaluation, it’s time to see if the RAG vector database returns the relevant documents for each question. As mentioned, the quality of results from the database is principally determined by the embedding model. How do we measure how often and in what order the relevant documents are returned across 100+ questions? If you have been playing with vector DBs, you also know about k value. The k value is basically how many of the top scoring documents (via embedding function) should be returned by the vector DB. This can get a bit hairy, and sometimes even daunting. Let’s focus on a few simple metrics that get the job done most of the time.
Accuracy - Do retrieved documents contain at least one relevant document for each question?
Recall - Do retrieved documents contain all the relevant documents for each question?
MRR (Mean Reciprocal Rank) - How early in the results does the first relevant document appear (with the top result scoring the highest and the last result scoring the lowest)?
The metrics above tell us whether the vector DB together with the embedding model are giving us relevant results. But just like with any engineering decision, the overall performance of the system needs to also consider latency and cost, among other factors. We need to consider the cost to maintain and operate the implementation on some infrastructure. We’ll try to bring all these factors together in the next section.
Step 3: Compare the embedding models
Note: The full embedding model comparison is laid out in this notebook.
We’ll be using ChromaDB to compare the different embedding models in the rest of this post. It’s quick to set up, and vector DB choice has less bearing on the embedding model selection.
We will start with all-MiniLM-L6-v2. It's a lightweight, general-purpose model from Sentence Transformers (free!), and it’s also set as the default embedding model for ChromaDB.
The interesting code portion for running 100+ questions we created earlier against this model is:
embedding_model_name = "all-MiniLM-L6-v2" # This is the default SentenceTransformer model that ChromaDB uses to embed the documents
default_sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
model_under_test, collection = create_vector_collection(embedding_model_name, default_sentence_transformer_ef)
# Perform a test run using the uploaded scenario
test_run_item = model_under_test.run_test(
scenario=scenario, # Use the scenario uploaded earlier in this notebook
name=f"RAG Comparison {embedding_model_name} - {random_string}",
test_run_type=TestRunType.INFORMATION_RETRIEVAL, # Specify that we are running a retrieval test
)
# Print a link back to Okareo app for evaluation visualization
print(f"See results in Okareo app for embedding model {embedding_model_name}: {test_run_item.app_link}")
After running above, here are the results you will see in the Okareo app:
The results of evaluation for the all-MiniLM-L6-v2
model.
The leftmost card tells us that the chances of the first result being relevant are about 66%, which is not great. If we consider the top 3 results (middle card), Accuracy and Recall improve to ~85%, which is a little better. These improve further to 93% if we consider the top 5 results (right card). Taking the top 5 results would make sense if we expected several relevant documents, say 2–3 for every question; but in this evaluation, we are expecting an exact answer from a single document. In this case, sending a context of 5 documents, 4 of which are not relevant, is going to confuse the LLM trying to generate an answer to the user’s question. It has been shown repeatedly in research and practical implementations that LLMs get confused by a lot of irrelevant context, irrespective of the context window maximum.
As a next step, let’s try OpenAI’s text-embedding-3-large
, the most powerful model at the time of writing this.
The results of evaluation for the text-embedding-3-large model
This is a major improvement overall. If we are considering the difference from top 3 to top 5 results, Accuracy and Recall don’t improve much. This means most of the relevant documents are returned in the top 3 or not at all. Passing a context of 3 documents (selecting k = 3) gives us a much tighter context window for the LLM in the answer generation step.
At this point, it’s worth mentioning latency and cost. all-MiniLM-L6-v2
runs locally and consumes only about 90 MB of memory. You still need to dedicate compute resources to it and size infrastructure proportionally to how much data volume you plan to ingest into vector DB/store and your retrieval throughput. Obviously, this is something you need to operate and maintain yourself, but it has a relatively small footprint. text-embedding-3-large from OpenAI sits behind an API, which makes it worry-free from the maintenance side. Being a much larger model and sitting behind an API also brings the cost (in tokens) and latency implications. Each document or question that is embedded requires one trip back and forth from the API. For reference, on my average Mac machine, embedding 30 documents and 100+ questions took about twice as long with text-embedding-3-large
(~52 sec) vs. all-MiniLM-L6-v2
(~25 sec).
Is there a compromise? We will try gte-small model from Alibaba DAMO Academy. It's fairly compact (120 MB memory) and something you can run locally.
The results of evaluation for the gte-small
model
Performance when considering the top 3 results is close to that of text-embedding-3-large
and is actually better in Accuracy and Recall when looking at the top 5 results. MRR is slightly worse, as text-embedding-3-large
tends to return relevant results closer to the top. Being a much smaller model overall, gte-small is on par with all-MiniLM-L6-v2
also in vector size (384 embedding dimension) and latency. Smaller vector size (384 as opposed to 3072 for text-embedding-3-large
) also means it will be faster at ingesting new documents and executing queries against the vector DB.
Here is a summary of all three models with k = 3:
The best embedding model for RAG is…
There is not going to be one best model for every RAG. However, you now have the key decision criteria that you can use for determining the best RAG model for your use case.
Aside from the three models we described (which you may want to evaluate), there are many more models to consider. The MTEB List is a good starting point. The key point to remember is that you always want to evaluate it with your own data.
Ready to run your own custom evaluation? Sign up for Okareo if you haven’t already, and then follow the steps in this article or in the Compare Embedding Models documentation page.
Just like with any development project, when building a RAG, one size does not fit all. The size and type of data you plan to retrieve from your RAG will sway many engineering decisions. One of the first and key decisions about a RAG is what embedding model you want to use.
So how do you decide which embedding model is best for your data? And what does “best” even mean in the RAG context? These questions are what we will answer next.
Note: You can run all the examples in this blog via these two notebooks:
Even more fun would be to plug in your own RAG data and see which embedding model does better. The primary tools we'll be using for this are Okareo and ChromaDB.
Why your RAG model has to fit your data
While many LLM models can be used in a RAG system, the performance of the resulting system may differ drastically based on the specific combination of your RAG and the LLM you choose. The best LLM according to a benchmark like MTEB may not perform best for your data.
The reason for this, as we described in our recent post about LLM benchmarking (and why you likely need baselining instead), is that benchmarks use extremely broad datasets for their scoring. If you take those benchmarks at face value and make a decision on the model to use based on them, there’s no guarantee that you’ll get good performance out of your system.
Trusting MTEB for model choices for RAG is like choosing new trousers based on reviews. Even if the reviews are stellar, the trousers may just not fit you at all.
Similarly, some models that rank high on the MTEB benchmark may have been overfitted for the specific tasks that they are being evaluated on, with users reporting much lower performance with their own data.
A much better approach for choosing a model for your RAG system is baselining using your own data. This way, you evaluate the performance of your system over time with data and user interactions that are as close as possible to your production use case, reducing risk that you’ll need to redesign your system already after having implemented it.
How to evaluate a model for RAG using your own data
Using your own data while evaluating a RAG may sound like a lot of work, it doesn’t have to be. This task essentially comes down to three steps:
Generating RAG questions that are similar to your users’ interactions
Determining the metrics you care about and the process of measuring them
Running the comparison across models with your RAG questions and measuring the metrics that you decided are important
Let’s go through these steps one by one.
Step 1: Generate your RAG questions
One of the most straightforward and popular uses of RAG is answering questions with some sort of chatbot. To capture this, we want a meaningful number of questions that represent our typical users and data they will be accessing. Where do we get this “meaningful number” of questions? We synthetically generate them. A good starting point is 100+ questions. Okareo SDKs (Python and Typescript) have a quick way of giving us this starting point.
Here is the key bit of Python code that generates the RAG questions:
okareo = Okareo(OKAREO_API_KEY)
random_string = ''.join(random.choices(string.ascii_letters, k=5))
# Use the scenario set of documents to generate a scenario of questions
generated_scenario = okareo.generate_scenario_set(
ScenarioSetGenerate(
name=f"Retrieval - Generated Scenario - {random_string}",
source_scenario_id=document_scenario.scenario_id,
number_examples=4, # Number of questions to generate for each document
generation_type=ScenarioType.TEXT_REVERSE_QUESTION, # This type is for questions from the text
generation_tone=GenerationTone.INFORMAL, # Specifying tone of the generated questions
post_template="""{"question": "{generation.input}", "document": "{input}"}""",# for easy validation we are generating questions next to source documents
)
)
# Print a link back to Okareo app to see the generated scenario
print(f"See generated scenario in Okareo app: {generated_scenario.app_link}")
Note: The full notebook for generating your RAG questions is available here
Step 2: Determine key metrics for your RAG system
Now that we have the questions for evaluation, it’s time to see if the RAG vector database returns the relevant documents for each question. As mentioned, the quality of results from the database is principally determined by the embedding model. How do we measure how often and in what order the relevant documents are returned across 100+ questions? If you have been playing with vector DBs, you also know about k value. The k value is basically how many of the top scoring documents (via embedding function) should be returned by the vector DB. This can get a bit hairy, and sometimes even daunting. Let’s focus on a few simple metrics that get the job done most of the time.
Accuracy - Do retrieved documents contain at least one relevant document for each question?
Recall - Do retrieved documents contain all the relevant documents for each question?
MRR (Mean Reciprocal Rank) - How early in the results does the first relevant document appear (with the top result scoring the highest and the last result scoring the lowest)?
The metrics above tell us whether the vector DB together with the embedding model are giving us relevant results. But just like with any engineering decision, the overall performance of the system needs to also consider latency and cost, among other factors. We need to consider the cost to maintain and operate the implementation on some infrastructure. We’ll try to bring all these factors together in the next section.
Step 3: Compare the embedding models
Note: The full embedding model comparison is laid out in this notebook.
We’ll be using ChromaDB to compare the different embedding models in the rest of this post. It’s quick to set up, and vector DB choice has less bearing on the embedding model selection.
We will start with all-MiniLM-L6-v2. It's a lightweight, general-purpose model from Sentence Transformers (free!), and it’s also set as the default embedding model for ChromaDB.
The interesting code portion for running 100+ questions we created earlier against this model is:
embedding_model_name = "all-MiniLM-L6-v2" # This is the default SentenceTransformer model that ChromaDB uses to embed the documents
default_sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
model_under_test, collection = create_vector_collection(embedding_model_name, default_sentence_transformer_ef)
# Perform a test run using the uploaded scenario
test_run_item = model_under_test.run_test(
scenario=scenario, # Use the scenario uploaded earlier in this notebook
name=f"RAG Comparison {embedding_model_name} - {random_string}",
test_run_type=TestRunType.INFORMATION_RETRIEVAL, # Specify that we are running a retrieval test
)
# Print a link back to Okareo app for evaluation visualization
print(f"See results in Okareo app for embedding model {embedding_model_name}: {test_run_item.app_link}")
After running above, here are the results you will see in the Okareo app:
The results of evaluation for the all-MiniLM-L6-v2
model.
The leftmost card tells us that the chances of the first result being relevant are about 66%, which is not great. If we consider the top 3 results (middle card), Accuracy and Recall improve to ~85%, which is a little better. These improve further to 93% if we consider the top 5 results (right card). Taking the top 5 results would make sense if we expected several relevant documents, say 2–3 for every question; but in this evaluation, we are expecting an exact answer from a single document. In this case, sending a context of 5 documents, 4 of which are not relevant, is going to confuse the LLM trying to generate an answer to the user’s question. It has been shown repeatedly in research and practical implementations that LLMs get confused by a lot of irrelevant context, irrespective of the context window maximum.
As a next step, let’s try OpenAI’s text-embedding-3-large
, the most powerful model at the time of writing this.
The results of evaluation for the text-embedding-3-large model
This is a major improvement overall. If we are considering the difference from top 3 to top 5 results, Accuracy and Recall don’t improve much. This means most of the relevant documents are returned in the top 3 or not at all. Passing a context of 3 documents (selecting k = 3) gives us a much tighter context window for the LLM in the answer generation step.
At this point, it’s worth mentioning latency and cost. all-MiniLM-L6-v2
runs locally and consumes only about 90 MB of memory. You still need to dedicate compute resources to it and size infrastructure proportionally to how much data volume you plan to ingest into vector DB/store and your retrieval throughput. Obviously, this is something you need to operate and maintain yourself, but it has a relatively small footprint. text-embedding-3-large from OpenAI sits behind an API, which makes it worry-free from the maintenance side. Being a much larger model and sitting behind an API also brings the cost (in tokens) and latency implications. Each document or question that is embedded requires one trip back and forth from the API. For reference, on my average Mac machine, embedding 30 documents and 100+ questions took about twice as long with text-embedding-3-large
(~52 sec) vs. all-MiniLM-L6-v2
(~25 sec).
Is there a compromise? We will try gte-small model from Alibaba DAMO Academy. It's fairly compact (120 MB memory) and something you can run locally.
The results of evaluation for the gte-small
model
Performance when considering the top 3 results is close to that of text-embedding-3-large
and is actually better in Accuracy and Recall when looking at the top 5 results. MRR is slightly worse, as text-embedding-3-large
tends to return relevant results closer to the top. Being a much smaller model overall, gte-small is on par with all-MiniLM-L6-v2
also in vector size (384 embedding dimension) and latency. Smaller vector size (384 as opposed to 3072 for text-embedding-3-large
) also means it will be faster at ingesting new documents and executing queries against the vector DB.
Here is a summary of all three models with k = 3:
The best embedding model for RAG is…
There is not going to be one best model for every RAG. However, you now have the key decision criteria that you can use for determining the best RAG model for your use case.
Aside from the three models we described (which you may want to evaluate), there are many more models to consider. The MTEB List is a good starting point. The key point to remember is that you always want to evaluate it with your own data.
Ready to run your own custom evaluation? Sign up for Okareo if you haven’t already, and then follow the steps in this article or in the Compare Embedding Models documentation page.
Just like with any development project, when building a RAG, one size does not fit all. The size and type of data you plan to retrieve from your RAG will sway many engineering decisions. One of the first and key decisions about a RAG is what embedding model you want to use.
So how do you decide which embedding model is best for your data? And what does “best” even mean in the RAG context? These questions are what we will answer next.
Note: You can run all the examples in this blog via these two notebooks:
Even more fun would be to plug in your own RAG data and see which embedding model does better. The primary tools we'll be using for this are Okareo and ChromaDB.
Why your RAG model has to fit your data
While many LLM models can be used in a RAG system, the performance of the resulting system may differ drastically based on the specific combination of your RAG and the LLM you choose. The best LLM according to a benchmark like MTEB may not perform best for your data.
The reason for this, as we described in our recent post about LLM benchmarking (and why you likely need baselining instead), is that benchmarks use extremely broad datasets for their scoring. If you take those benchmarks at face value and make a decision on the model to use based on them, there’s no guarantee that you’ll get good performance out of your system.
Trusting MTEB for model choices for RAG is like choosing new trousers based on reviews. Even if the reviews are stellar, the trousers may just not fit you at all.
Similarly, some models that rank high on the MTEB benchmark may have been overfitted for the specific tasks that they are being evaluated on, with users reporting much lower performance with their own data.
A much better approach for choosing a model for your RAG system is baselining using your own data. This way, you evaluate the performance of your system over time with data and user interactions that are as close as possible to your production use case, reducing risk that you’ll need to redesign your system already after having implemented it.
How to evaluate a model for RAG using your own data
Using your own data while evaluating a RAG may sound like a lot of work, it doesn’t have to be. This task essentially comes down to three steps:
Generating RAG questions that are similar to your users’ interactions
Determining the metrics you care about and the process of measuring them
Running the comparison across models with your RAG questions and measuring the metrics that you decided are important
Let’s go through these steps one by one.
Step 1: Generate your RAG questions
One of the most straightforward and popular uses of RAG is answering questions with some sort of chatbot. To capture this, we want a meaningful number of questions that represent our typical users and data they will be accessing. Where do we get this “meaningful number” of questions? We synthetically generate them. A good starting point is 100+ questions. Okareo SDKs (Python and Typescript) have a quick way of giving us this starting point.
Here is the key bit of Python code that generates the RAG questions:
okareo = Okareo(OKAREO_API_KEY)
random_string = ''.join(random.choices(string.ascii_letters, k=5))
# Use the scenario set of documents to generate a scenario of questions
generated_scenario = okareo.generate_scenario_set(
ScenarioSetGenerate(
name=f"Retrieval - Generated Scenario - {random_string}",
source_scenario_id=document_scenario.scenario_id,
number_examples=4, # Number of questions to generate for each document
generation_type=ScenarioType.TEXT_REVERSE_QUESTION, # This type is for questions from the text
generation_tone=GenerationTone.INFORMAL, # Specifying tone of the generated questions
post_template="""{"question": "{generation.input}", "document": "{input}"}""",# for easy validation we are generating questions next to source documents
)
)
# Print a link back to Okareo app to see the generated scenario
print(f"See generated scenario in Okareo app: {generated_scenario.app_link}")
Note: The full notebook for generating your RAG questions is available here
Step 2: Determine key metrics for your RAG system
Now that we have the questions for evaluation, it’s time to see if the RAG vector database returns the relevant documents for each question. As mentioned, the quality of results from the database is principally determined by the embedding model. How do we measure how often and in what order the relevant documents are returned across 100+ questions? If you have been playing with vector DBs, you also know about k value. The k value is basically how many of the top scoring documents (via embedding function) should be returned by the vector DB. This can get a bit hairy, and sometimes even daunting. Let’s focus on a few simple metrics that get the job done most of the time.
Accuracy - Do retrieved documents contain at least one relevant document for each question?
Recall - Do retrieved documents contain all the relevant documents for each question?
MRR (Mean Reciprocal Rank) - How early in the results does the first relevant document appear (with the top result scoring the highest and the last result scoring the lowest)?
The metrics above tell us whether the vector DB together with the embedding model are giving us relevant results. But just like with any engineering decision, the overall performance of the system needs to also consider latency and cost, among other factors. We need to consider the cost to maintain and operate the implementation on some infrastructure. We’ll try to bring all these factors together in the next section.
Step 3: Compare the embedding models
Note: The full embedding model comparison is laid out in this notebook.
We’ll be using ChromaDB to compare the different embedding models in the rest of this post. It’s quick to set up, and vector DB choice has less bearing on the embedding model selection.
We will start with all-MiniLM-L6-v2. It's a lightweight, general-purpose model from Sentence Transformers (free!), and it’s also set as the default embedding model for ChromaDB.
The interesting code portion for running 100+ questions we created earlier against this model is:
embedding_model_name = "all-MiniLM-L6-v2" # This is the default SentenceTransformer model that ChromaDB uses to embed the documents
default_sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
model_under_test, collection = create_vector_collection(embedding_model_name, default_sentence_transformer_ef)
# Perform a test run using the uploaded scenario
test_run_item = model_under_test.run_test(
scenario=scenario, # Use the scenario uploaded earlier in this notebook
name=f"RAG Comparison {embedding_model_name} - {random_string}",
test_run_type=TestRunType.INFORMATION_RETRIEVAL, # Specify that we are running a retrieval test
)
# Print a link back to Okareo app for evaluation visualization
print(f"See results in Okareo app for embedding model {embedding_model_name}: {test_run_item.app_link}")
After running above, here are the results you will see in the Okareo app:
The results of evaluation for the all-MiniLM-L6-v2
model.
The leftmost card tells us that the chances of the first result being relevant are about 66%, which is not great. If we consider the top 3 results (middle card), Accuracy and Recall improve to ~85%, which is a little better. These improve further to 93% if we consider the top 5 results (right card). Taking the top 5 results would make sense if we expected several relevant documents, say 2–3 for every question; but in this evaluation, we are expecting an exact answer from a single document. In this case, sending a context of 5 documents, 4 of which are not relevant, is going to confuse the LLM trying to generate an answer to the user’s question. It has been shown repeatedly in research and practical implementations that LLMs get confused by a lot of irrelevant context, irrespective of the context window maximum.
As a next step, let’s try OpenAI’s text-embedding-3-large
, the most powerful model at the time of writing this.
The results of evaluation for the text-embedding-3-large model
This is a major improvement overall. If we are considering the difference from top 3 to top 5 results, Accuracy and Recall don’t improve much. This means most of the relevant documents are returned in the top 3 or not at all. Passing a context of 3 documents (selecting k = 3) gives us a much tighter context window for the LLM in the answer generation step.
At this point, it’s worth mentioning latency and cost. all-MiniLM-L6-v2
runs locally and consumes only about 90 MB of memory. You still need to dedicate compute resources to it and size infrastructure proportionally to how much data volume you plan to ingest into vector DB/store and your retrieval throughput. Obviously, this is something you need to operate and maintain yourself, but it has a relatively small footprint. text-embedding-3-large from OpenAI sits behind an API, which makes it worry-free from the maintenance side. Being a much larger model and sitting behind an API also brings the cost (in tokens) and latency implications. Each document or question that is embedded requires one trip back and forth from the API. For reference, on my average Mac machine, embedding 30 documents and 100+ questions took about twice as long with text-embedding-3-large
(~52 sec) vs. all-MiniLM-L6-v2
(~25 sec).
Is there a compromise? We will try gte-small model from Alibaba DAMO Academy. It's fairly compact (120 MB memory) and something you can run locally.
The results of evaluation for the gte-small
model
Performance when considering the top 3 results is close to that of text-embedding-3-large
and is actually better in Accuracy and Recall when looking at the top 5 results. MRR is slightly worse, as text-embedding-3-large
tends to return relevant results closer to the top. Being a much smaller model overall, gte-small is on par with all-MiniLM-L6-v2
also in vector size (384 embedding dimension) and latency. Smaller vector size (384 as opposed to 3072 for text-embedding-3-large
) also means it will be faster at ingesting new documents and executing queries against the vector DB.
Here is a summary of all three models with k = 3:
The best embedding model for RAG is…
There is not going to be one best model for every RAG. However, you now have the key decision criteria that you can use for determining the best RAG model for your use case.
Aside from the three models we described (which you may want to evaluate), there are many more models to consider. The MTEB List is a good starting point. The key point to remember is that you always want to evaluate it with your own data.
Ready to run your own custom evaluation? Sign up for Okareo if you haven’t already, and then follow the steps in this article or in the Compare Embedding Models documentation page.
Just like with any development project, when building a RAG, one size does not fit all. The size and type of data you plan to retrieve from your RAG will sway many engineering decisions. One of the first and key decisions about a RAG is what embedding model you want to use.
So how do you decide which embedding model is best for your data? And what does “best” even mean in the RAG context? These questions are what we will answer next.
Note: You can run all the examples in this blog via these two notebooks:
Even more fun would be to plug in your own RAG data and see which embedding model does better. The primary tools we'll be using for this are Okareo and ChromaDB.
Why your RAG model has to fit your data
While many LLM models can be used in a RAG system, the performance of the resulting system may differ drastically based on the specific combination of your RAG and the LLM you choose. The best LLM according to a benchmark like MTEB may not perform best for your data.
The reason for this, as we described in our recent post about LLM benchmarking (and why you likely need baselining instead), is that benchmarks use extremely broad datasets for their scoring. If you take those benchmarks at face value and make a decision on the model to use based on them, there’s no guarantee that you’ll get good performance out of your system.
Trusting MTEB for model choices for RAG is like choosing new trousers based on reviews. Even if the reviews are stellar, the trousers may just not fit you at all.
Similarly, some models that rank high on the MTEB benchmark may have been overfitted for the specific tasks that they are being evaluated on, with users reporting much lower performance with their own data.
A much better approach for choosing a model for your RAG system is baselining using your own data. This way, you evaluate the performance of your system over time with data and user interactions that are as close as possible to your production use case, reducing risk that you’ll need to redesign your system already after having implemented it.
How to evaluate a model for RAG using your own data
Using your own data while evaluating a RAG may sound like a lot of work, it doesn’t have to be. This task essentially comes down to three steps:
Generating RAG questions that are similar to your users’ interactions
Determining the metrics you care about and the process of measuring them
Running the comparison across models with your RAG questions and measuring the metrics that you decided are important
Let’s go through these steps one by one.
Step 1: Generate your RAG questions
One of the most straightforward and popular uses of RAG is answering questions with some sort of chatbot. To capture this, we want a meaningful number of questions that represent our typical users and data they will be accessing. Where do we get this “meaningful number” of questions? We synthetically generate them. A good starting point is 100+ questions. Okareo SDKs (Python and Typescript) have a quick way of giving us this starting point.
Here is the key bit of Python code that generates the RAG questions:
okareo = Okareo(OKAREO_API_KEY)
random_string = ''.join(random.choices(string.ascii_letters, k=5))
# Use the scenario set of documents to generate a scenario of questions
generated_scenario = okareo.generate_scenario_set(
ScenarioSetGenerate(
name=f"Retrieval - Generated Scenario - {random_string}",
source_scenario_id=document_scenario.scenario_id,
number_examples=4, # Number of questions to generate for each document
generation_type=ScenarioType.TEXT_REVERSE_QUESTION, # This type is for questions from the text
generation_tone=GenerationTone.INFORMAL, # Specifying tone of the generated questions
post_template="""{"question": "{generation.input}", "document": "{input}"}""",# for easy validation we are generating questions next to source documents
)
)
# Print a link back to Okareo app to see the generated scenario
print(f"See generated scenario in Okareo app: {generated_scenario.app_link}")
Note: The full notebook for generating your RAG questions is available here
Step 2: Determine key metrics for your RAG system
Now that we have the questions for evaluation, it’s time to see if the RAG vector database returns the relevant documents for each question. As mentioned, the quality of results from the database is principally determined by the embedding model. How do we measure how often and in what order the relevant documents are returned across 100+ questions? If you have been playing with vector DBs, you also know about k value. The k value is basically how many of the top scoring documents (via embedding function) should be returned by the vector DB. This can get a bit hairy, and sometimes even daunting. Let’s focus on a few simple metrics that get the job done most of the time.
Accuracy - Do retrieved documents contain at least one relevant document for each question?
Recall - Do retrieved documents contain all the relevant documents for each question?
MRR (Mean Reciprocal Rank) - How early in the results does the first relevant document appear (with the top result scoring the highest and the last result scoring the lowest)?
The metrics above tell us whether the vector DB together with the embedding model are giving us relevant results. But just like with any engineering decision, the overall performance of the system needs to also consider latency and cost, among other factors. We need to consider the cost to maintain and operate the implementation on some infrastructure. We’ll try to bring all these factors together in the next section.
Step 3: Compare the embedding models
Note: The full embedding model comparison is laid out in this notebook.
We’ll be using ChromaDB to compare the different embedding models in the rest of this post. It’s quick to set up, and vector DB choice has less bearing on the embedding model selection.
We will start with all-MiniLM-L6-v2. It's a lightweight, general-purpose model from Sentence Transformers (free!), and it’s also set as the default embedding model for ChromaDB.
The interesting code portion for running 100+ questions we created earlier against this model is:
embedding_model_name = "all-MiniLM-L6-v2" # This is the default SentenceTransformer model that ChromaDB uses to embed the documents
default_sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
model_under_test, collection = create_vector_collection(embedding_model_name, default_sentence_transformer_ef)
# Perform a test run using the uploaded scenario
test_run_item = model_under_test.run_test(
scenario=scenario, # Use the scenario uploaded earlier in this notebook
name=f"RAG Comparison {embedding_model_name} - {random_string}",
test_run_type=TestRunType.INFORMATION_RETRIEVAL, # Specify that we are running a retrieval test
)
# Print a link back to Okareo app for evaluation visualization
print(f"See results in Okareo app for embedding model {embedding_model_name}: {test_run_item.app_link}")
After running above, here are the results you will see in the Okareo app:
The results of evaluation for the all-MiniLM-L6-v2
model.
The leftmost card tells us that the chances of the first result being relevant are about 66%, which is not great. If we consider the top 3 results (middle card), Accuracy and Recall improve to ~85%, which is a little better. These improve further to 93% if we consider the top 5 results (right card). Taking the top 5 results would make sense if we expected several relevant documents, say 2–3 for every question; but in this evaluation, we are expecting an exact answer from a single document. In this case, sending a context of 5 documents, 4 of which are not relevant, is going to confuse the LLM trying to generate an answer to the user’s question. It has been shown repeatedly in research and practical implementations that LLMs get confused by a lot of irrelevant context, irrespective of the context window maximum.
As a next step, let’s try OpenAI’s text-embedding-3-large
, the most powerful model at the time of writing this.
The results of evaluation for the text-embedding-3-large model
This is a major improvement overall. If we are considering the difference from top 3 to top 5 results, Accuracy and Recall don’t improve much. This means most of the relevant documents are returned in the top 3 or not at all. Passing a context of 3 documents (selecting k = 3) gives us a much tighter context window for the LLM in the answer generation step.
At this point, it’s worth mentioning latency and cost. all-MiniLM-L6-v2
runs locally and consumes only about 90 MB of memory. You still need to dedicate compute resources to it and size infrastructure proportionally to how much data volume you plan to ingest into vector DB/store and your retrieval throughput. Obviously, this is something you need to operate and maintain yourself, but it has a relatively small footprint. text-embedding-3-large from OpenAI sits behind an API, which makes it worry-free from the maintenance side. Being a much larger model and sitting behind an API also brings the cost (in tokens) and latency implications. Each document or question that is embedded requires one trip back and forth from the API. For reference, on my average Mac machine, embedding 30 documents and 100+ questions took about twice as long with text-embedding-3-large
(~52 sec) vs. all-MiniLM-L6-v2
(~25 sec).
Is there a compromise? We will try gte-small model from Alibaba DAMO Academy. It's fairly compact (120 MB memory) and something you can run locally.
The results of evaluation for the gte-small
model
Performance when considering the top 3 results is close to that of text-embedding-3-large
and is actually better in Accuracy and Recall when looking at the top 5 results. MRR is slightly worse, as text-embedding-3-large
tends to return relevant results closer to the top. Being a much smaller model overall, gte-small is on par with all-MiniLM-L6-v2
also in vector size (384 embedding dimension) and latency. Smaller vector size (384 as opposed to 3072 for text-embedding-3-large
) also means it will be faster at ingesting new documents and executing queries against the vector DB.
Here is a summary of all three models with k = 3:
The best embedding model for RAG is…
There is not going to be one best model for every RAG. However, you now have the key decision criteria that you can use for determining the best RAG model for your use case.
Aside from the three models we described (which you may want to evaluate), there are many more models to consider. The MTEB List is a good starting point. The key point to remember is that you always want to evaluate it with your own data.
Ready to run your own custom evaluation? Sign up for Okareo if you haven’t already, and then follow the steps in this article or in the Compare Embedding Models documentation page.