Optimizing Your RAG - Choose an Embedding Model That Fits Your Data

RAG

Boris Selitser

,

Co-founder of Okareo

August 12, 2024

Just like with any development project, when building a RAG one size does not fit all. The size and type of data you plan to retrieve from your RAG will sway many engineering decisions. One of the first and key decisions about a RAG is what embedding model you want to use. You need to know how well the embedding model will help to find similarity between your RAG documents and typical user questions.

  • So how do you decide which embedding model is best for your data?

  • And what does ‘best’ even mean in the RAG context?

These questions are what we will answer next.

Note You can run all the examples in this blog via these two notebooks:

Even more fun would be to plug in your own RAG data and see which embedding model does better. Primary tools we'll be using for this are Okareo and ChromaDB.

Generate Your RAG Questions

Now we can choose an embedding model based on vibes or even check out the Massive Text Embedding Benchmark (MTEB) Leaderboard. But again, one size does not fit all use cases, and we want this to be a data driven decision. One of the most straightforward and popular uses of RAG is some sort of chatbot question answering. To capture this we want a meaningful number of questions that represent our typical users and data they will be accessing. Where do we get this ‘meaningful number’ of questions? We synthetically generate them. A good starting point is 100+ questions. Okareo SDKs (Python and Typescript) have a quick way of giving us this starting point.

Here is the key bit of Python code that generates the RAG questions:

okareo = Okareo(OKAREO_API_KEY) 
random_string = ''.join(random.choices(string.ascii_letters, k=5))
# Use the scenario set of documents to generate a scenario of questions 
generated_scenario = okareo.generate_scenario_set(
  ScenarioSetGenerate(
    name=f"Retrieval - Generated Scenario - {random_string}",
    source_scenario_id=document_scenario.scenario_id,
    number_examples=4, # Number of questions to generate for each document
    generation_type=ScenarioType.TEXT_REVERSE_QUESTION, # This type is for questions from the text
    generation_tone=GenerationTone.INFORMAL, # Specifying tone of the generated questions
    post_template="""{"question": "{generation.input}", "document": "{input}"}""",# for easy validation we are generating questions next to source documents
  )
)

# Print a link back to Okareo app to see the generated scenario 
print(f"See generated scenario in Okareo app: {generated_scenario.app_link}")

Note - Full notebook for Generating Your RAG Questions is available here

Measuring What Matters - Metrics

Now that we have the questions for evaluation it’s time to see if the RAG vector database returns the relevant documents for each question. As mentioned, the quality of results from the database is principally determined by the embedding model. How do we measure how often and in what order are the relevant documents returned across 100+ questions? If you have been playing with vector DBs, you also know about k value. k value is basically how many of the top scoring documents (via embedding function) should be returned by the vector DB. This can get a bit hairy, and sometimes even daunting. Let’s focus on a few simple metrics that get the job done most of the time.

  • Accuracy - For each question do retrieved documents contain at least one relevant document.

  • Recall - For each question do retrieved documents contain all the relevant documents.

  • MRR (Mean Reciprocal Rank) - How early in the results does the first relevant document appear. With the top result scoring the highest and the last result scoring the lowest.

Metrics above tell us if the vector DB together with the embedding model are giving us relevant results. But just like with any engineering decision the overall performance of the system needs to also consider latency and cost, among other factors. We need to consider the cost to maintain and operate the implementation on some infrastructure. We’ll try to bring all these factors together in the next section.

Let’s Kick Off This Model Showdown!

:::note The full embedding model comparison is laid out in this notebook: Model Showdown :::

We’ll be using ChromaDB to compare the different embedding models in the rest of this blog. It’s quick to set up and vector DB choice has less bearing on the embedding model selection.

We will start with all-MiniLM-L6-v2. It's a lightweight general purpose model from Sentence Transformers (free!) and it’s also set as the default embedding model for ChromaDB.

The interesting code portion for running a 100+ questions we created earlier against this model is:

embedding_model_name = "all-MiniLM-L6-v2" # This is the default SentenceTransformer model that ChromaDB uses to embed the documents 
default_sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2") 
model_under_test, collection = create_vector_collection(embedding_model_name, default_sentence_transformer_ef) 

# Perform a test run using the uploaded scenario  
test_run_item = model_under_test.run_test(
  scenario=scenario, # use the scenario uploaded earlier in this notebook
  name=f"RAG Comparison {embedding_model_name} - {random_string}",
  test_run_type=TestRunType.INFORMATION_RETRIEVAL, # specify that we are running a retrieval test
)

# Print a link back to Okareo app for evaluation visualization 
print(f"See results in Okareo app for embedding model {embedding_model_name}: {test_run_item.app_link}")

After running above here are the results you will see in Okareo app:

The leftmost card tells us that the chances of the first result being relevant are about 66% … it’s not great. If we consider the top 3 results (middle card), Accuracy and Recall improve to ~85% (better, but still not great) and even as much as 93% if we consider the top 5 results. Taking the top 5 results would make sense if we expected several relevant documents, say 2-3 for every question. But in this evaluation we are expecting an exact answer from a single document. In this case sending a context of 5 documents, 4 of which are not relevant is going to confuse the LLM trying to generate an answer to the user question. This is fairly intuitive and has been shown repeatedly in research and practical implementations that LLMs get confused by a lot of irrelevant context, irrespective of the context window maximum.

It would not be surprising if we try one of the embedding models from OpenAI. Let’s try text-embedding-3-large, the most powerful model at the time of writing this.

This is a major improvement overall. If we are considering the difference from top 3 to top 5 results Accuracy and Recall don’t improve much. This means most of the relevant documents are returned in the top 3 or not at all. Passing a context of 3 documents (selecting k = 3) gives us a much tighter context window for LLM in the answer generation step.

Ok, time to talk about latency and cost. all-MiniLM-L6-v2 runs locally and consumes only about 90 MB of memory. You still need to dedicate compute resources to it and size infrastructure proportionally to how much data volume you plan to ingest into vector DB/store and your retrieval throughput. Obviously something you need to operate and maintain yourself, but it has a relatively small footprint. text-embedding-3-large from OpenAI sits behind an API which makes it worry-free from the maintenance side. Being a much larger model and sitting behind an API also brings the cost (in tokens) and latency implications. For every document or question embedding one needs to make an API trip back and forth. For reference, on my average mac embedding 30 documents and 100+ questions took about twice as long with text-embedding-3-large (~52 sec) vs. all-MiniLM-L6-v2 (~25 sec).

Is there a compromise? We will try gte-small model from Alibaba DAMO Academy. It's fairly compact (120 MB memory) and something you can run locally.

Performance when considering top 3 results is close to that of text-embedding-3-large and is actually better in Accuracy and Recall when looking at top 5 results. MRR is slightly worse as text-embedding-3-large tends to return relevant results closer to the top. Being a much smaller model overall, gte-small is on par with all-MiniLM-L6-v2 also in vector size (384 embedding dimension) and latency. Smaller vector size (384 vs. 3072 for text-embedding-3-large) also means it will be faster at ingesting new documents and executing queries against the vector DB.

No perfect answer but the key decision criteria are there for specific RAG requirements and use case. Many more models to consider. MTEB List is a good starting point. Key point to remember is that you always want to evaluate it with your own data.

Here is a summary of all three models with k = 3:

Just like with any development project, when building a RAG one size does not fit all. The size and type of data you plan to retrieve from your RAG will sway many engineering decisions. One of the first and key decisions about a RAG is what embedding model you want to use. You need to know how well the embedding model will help to find similarity between your RAG documents and typical user questions.

  • So how do you decide which embedding model is best for your data?

  • And what does ‘best’ even mean in the RAG context?

These questions are what we will answer next.

Note You can run all the examples in this blog via these two notebooks:

Even more fun would be to plug in your own RAG data and see which embedding model does better. Primary tools we'll be using for this are Okareo and ChromaDB.

Generate Your RAG Questions

Now we can choose an embedding model based on vibes or even check out the Massive Text Embedding Benchmark (MTEB) Leaderboard. But again, one size does not fit all use cases, and we want this to be a data driven decision. One of the most straightforward and popular uses of RAG is some sort of chatbot question answering. To capture this we want a meaningful number of questions that represent our typical users and data they will be accessing. Where do we get this ‘meaningful number’ of questions? We synthetically generate them. A good starting point is 100+ questions. Okareo SDKs (Python and Typescript) have a quick way of giving us this starting point.

Here is the key bit of Python code that generates the RAG questions:

okareo = Okareo(OKAREO_API_KEY) 
random_string = ''.join(random.choices(string.ascii_letters, k=5))
# Use the scenario set of documents to generate a scenario of questions 
generated_scenario = okareo.generate_scenario_set(
  ScenarioSetGenerate(
    name=f"Retrieval - Generated Scenario - {random_string}",
    source_scenario_id=document_scenario.scenario_id,
    number_examples=4, # Number of questions to generate for each document
    generation_type=ScenarioType.TEXT_REVERSE_QUESTION, # This type is for questions from the text
    generation_tone=GenerationTone.INFORMAL, # Specifying tone of the generated questions
    post_template="""{"question": "{generation.input}", "document": "{input}"}""",# for easy validation we are generating questions next to source documents
  )
)

# Print a link back to Okareo app to see the generated scenario 
print(f"See generated scenario in Okareo app: {generated_scenario.app_link}")

Note - Full notebook for Generating Your RAG Questions is available here

Measuring What Matters - Metrics

Now that we have the questions for evaluation it’s time to see if the RAG vector database returns the relevant documents for each question. As mentioned, the quality of results from the database is principally determined by the embedding model. How do we measure how often and in what order are the relevant documents returned across 100+ questions? If you have been playing with vector DBs, you also know about k value. k value is basically how many of the top scoring documents (via embedding function) should be returned by the vector DB. This can get a bit hairy, and sometimes even daunting. Let’s focus on a few simple metrics that get the job done most of the time.

  • Accuracy - For each question do retrieved documents contain at least one relevant document.

  • Recall - For each question do retrieved documents contain all the relevant documents.

  • MRR (Mean Reciprocal Rank) - How early in the results does the first relevant document appear. With the top result scoring the highest and the last result scoring the lowest.

Metrics above tell us if the vector DB together with the embedding model are giving us relevant results. But just like with any engineering decision the overall performance of the system needs to also consider latency and cost, among other factors. We need to consider the cost to maintain and operate the implementation on some infrastructure. We’ll try to bring all these factors together in the next section.

Let’s Kick Off This Model Showdown!

:::note The full embedding model comparison is laid out in this notebook: Model Showdown :::

We’ll be using ChromaDB to compare the different embedding models in the rest of this blog. It’s quick to set up and vector DB choice has less bearing on the embedding model selection.

We will start with all-MiniLM-L6-v2. It's a lightweight general purpose model from Sentence Transformers (free!) and it’s also set as the default embedding model for ChromaDB.

The interesting code portion for running a 100+ questions we created earlier against this model is:

embedding_model_name = "all-MiniLM-L6-v2" # This is the default SentenceTransformer model that ChromaDB uses to embed the documents 
default_sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2") 
model_under_test, collection = create_vector_collection(embedding_model_name, default_sentence_transformer_ef) 

# Perform a test run using the uploaded scenario  
test_run_item = model_under_test.run_test(
  scenario=scenario, # use the scenario uploaded earlier in this notebook
  name=f"RAG Comparison {embedding_model_name} - {random_string}",
  test_run_type=TestRunType.INFORMATION_RETRIEVAL, # specify that we are running a retrieval test
)

# Print a link back to Okareo app for evaluation visualization 
print(f"See results in Okareo app for embedding model {embedding_model_name}: {test_run_item.app_link}")

After running above here are the results you will see in Okareo app:

The leftmost card tells us that the chances of the first result being relevant are about 66% … it’s not great. If we consider the top 3 results (middle card), Accuracy and Recall improve to ~85% (better, but still not great) and even as much as 93% if we consider the top 5 results. Taking the top 5 results would make sense if we expected several relevant documents, say 2-3 for every question. But in this evaluation we are expecting an exact answer from a single document. In this case sending a context of 5 documents, 4 of which are not relevant is going to confuse the LLM trying to generate an answer to the user question. This is fairly intuitive and has been shown repeatedly in research and practical implementations that LLMs get confused by a lot of irrelevant context, irrespective of the context window maximum.

It would not be surprising if we try one of the embedding models from OpenAI. Let’s try text-embedding-3-large, the most powerful model at the time of writing this.

This is a major improvement overall. If we are considering the difference from top 3 to top 5 results Accuracy and Recall don’t improve much. This means most of the relevant documents are returned in the top 3 or not at all. Passing a context of 3 documents (selecting k = 3) gives us a much tighter context window for LLM in the answer generation step.

Ok, time to talk about latency and cost. all-MiniLM-L6-v2 runs locally and consumes only about 90 MB of memory. You still need to dedicate compute resources to it and size infrastructure proportionally to how much data volume you plan to ingest into vector DB/store and your retrieval throughput. Obviously something you need to operate and maintain yourself, but it has a relatively small footprint. text-embedding-3-large from OpenAI sits behind an API which makes it worry-free from the maintenance side. Being a much larger model and sitting behind an API also brings the cost (in tokens) and latency implications. For every document or question embedding one needs to make an API trip back and forth. For reference, on my average mac embedding 30 documents and 100+ questions took about twice as long with text-embedding-3-large (~52 sec) vs. all-MiniLM-L6-v2 (~25 sec).

Is there a compromise? We will try gte-small model from Alibaba DAMO Academy. It's fairly compact (120 MB memory) and something you can run locally.

Performance when considering top 3 results is close to that of text-embedding-3-large and is actually better in Accuracy and Recall when looking at top 5 results. MRR is slightly worse as text-embedding-3-large tends to return relevant results closer to the top. Being a much smaller model overall, gte-small is on par with all-MiniLM-L6-v2 also in vector size (384 embedding dimension) and latency. Smaller vector size (384 vs. 3072 for text-embedding-3-large) also means it will be faster at ingesting new documents and executing queries against the vector DB.

No perfect answer but the key decision criteria are there for specific RAG requirements and use case. Many more models to consider. MTEB List is a good starting point. Key point to remember is that you always want to evaluate it with your own data.

Here is a summary of all three models with k = 3:

Just like with any development project, when building a RAG one size does not fit all. The size and type of data you plan to retrieve from your RAG will sway many engineering decisions. One of the first and key decisions about a RAG is what embedding model you want to use. You need to know how well the embedding model will help to find similarity between your RAG documents and typical user questions.

  • So how do you decide which embedding model is best for your data?

  • And what does ‘best’ even mean in the RAG context?

These questions are what we will answer next.

Note You can run all the examples in this blog via these two notebooks:

Even more fun would be to plug in your own RAG data and see which embedding model does better. Primary tools we'll be using for this are Okareo and ChromaDB.

Generate Your RAG Questions

Now we can choose an embedding model based on vibes or even check out the Massive Text Embedding Benchmark (MTEB) Leaderboard. But again, one size does not fit all use cases, and we want this to be a data driven decision. One of the most straightforward and popular uses of RAG is some sort of chatbot question answering. To capture this we want a meaningful number of questions that represent our typical users and data they will be accessing. Where do we get this ‘meaningful number’ of questions? We synthetically generate them. A good starting point is 100+ questions. Okareo SDKs (Python and Typescript) have a quick way of giving us this starting point.

Here is the key bit of Python code that generates the RAG questions:

okareo = Okareo(OKAREO_API_KEY) 
random_string = ''.join(random.choices(string.ascii_letters, k=5))
# Use the scenario set of documents to generate a scenario of questions 
generated_scenario = okareo.generate_scenario_set(
  ScenarioSetGenerate(
    name=f"Retrieval - Generated Scenario - {random_string}",
    source_scenario_id=document_scenario.scenario_id,
    number_examples=4, # Number of questions to generate for each document
    generation_type=ScenarioType.TEXT_REVERSE_QUESTION, # This type is for questions from the text
    generation_tone=GenerationTone.INFORMAL, # Specifying tone of the generated questions
    post_template="""{"question": "{generation.input}", "document": "{input}"}""",# for easy validation we are generating questions next to source documents
  )
)

# Print a link back to Okareo app to see the generated scenario 
print(f"See generated scenario in Okareo app: {generated_scenario.app_link}")

Note - Full notebook for Generating Your RAG Questions is available here

Measuring What Matters - Metrics

Now that we have the questions for evaluation it’s time to see if the RAG vector database returns the relevant documents for each question. As mentioned, the quality of results from the database is principally determined by the embedding model. How do we measure how often and in what order are the relevant documents returned across 100+ questions? If you have been playing with vector DBs, you also know about k value. k value is basically how many of the top scoring documents (via embedding function) should be returned by the vector DB. This can get a bit hairy, and sometimes even daunting. Let’s focus on a few simple metrics that get the job done most of the time.

  • Accuracy - For each question do retrieved documents contain at least one relevant document.

  • Recall - For each question do retrieved documents contain all the relevant documents.

  • MRR (Mean Reciprocal Rank) - How early in the results does the first relevant document appear. With the top result scoring the highest and the last result scoring the lowest.

Metrics above tell us if the vector DB together with the embedding model are giving us relevant results. But just like with any engineering decision the overall performance of the system needs to also consider latency and cost, among other factors. We need to consider the cost to maintain and operate the implementation on some infrastructure. We’ll try to bring all these factors together in the next section.

Let’s Kick Off This Model Showdown!

:::note The full embedding model comparison is laid out in this notebook: Model Showdown :::

We’ll be using ChromaDB to compare the different embedding models in the rest of this blog. It’s quick to set up and vector DB choice has less bearing on the embedding model selection.

We will start with all-MiniLM-L6-v2. It's a lightweight general purpose model from Sentence Transformers (free!) and it’s also set as the default embedding model for ChromaDB.

The interesting code portion for running a 100+ questions we created earlier against this model is:

embedding_model_name = "all-MiniLM-L6-v2" # This is the default SentenceTransformer model that ChromaDB uses to embed the documents 
default_sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2") 
model_under_test, collection = create_vector_collection(embedding_model_name, default_sentence_transformer_ef) 

# Perform a test run using the uploaded scenario  
test_run_item = model_under_test.run_test(
  scenario=scenario, # use the scenario uploaded earlier in this notebook
  name=f"RAG Comparison {embedding_model_name} - {random_string}",
  test_run_type=TestRunType.INFORMATION_RETRIEVAL, # specify that we are running a retrieval test
)

# Print a link back to Okareo app for evaluation visualization 
print(f"See results in Okareo app for embedding model {embedding_model_name}: {test_run_item.app_link}")

After running above here are the results you will see in Okareo app:

The leftmost card tells us that the chances of the first result being relevant are about 66% … it’s not great. If we consider the top 3 results (middle card), Accuracy and Recall improve to ~85% (better, but still not great) and even as much as 93% if we consider the top 5 results. Taking the top 5 results would make sense if we expected several relevant documents, say 2-3 for every question. But in this evaluation we are expecting an exact answer from a single document. In this case sending a context of 5 documents, 4 of which are not relevant is going to confuse the LLM trying to generate an answer to the user question. This is fairly intuitive and has been shown repeatedly in research and practical implementations that LLMs get confused by a lot of irrelevant context, irrespective of the context window maximum.

It would not be surprising if we try one of the embedding models from OpenAI. Let’s try text-embedding-3-large, the most powerful model at the time of writing this.

This is a major improvement overall. If we are considering the difference from top 3 to top 5 results Accuracy and Recall don’t improve much. This means most of the relevant documents are returned in the top 3 or not at all. Passing a context of 3 documents (selecting k = 3) gives us a much tighter context window for LLM in the answer generation step.

Ok, time to talk about latency and cost. all-MiniLM-L6-v2 runs locally and consumes only about 90 MB of memory. You still need to dedicate compute resources to it and size infrastructure proportionally to how much data volume you plan to ingest into vector DB/store and your retrieval throughput. Obviously something you need to operate and maintain yourself, but it has a relatively small footprint. text-embedding-3-large from OpenAI sits behind an API which makes it worry-free from the maintenance side. Being a much larger model and sitting behind an API also brings the cost (in tokens) and latency implications. For every document or question embedding one needs to make an API trip back and forth. For reference, on my average mac embedding 30 documents and 100+ questions took about twice as long with text-embedding-3-large (~52 sec) vs. all-MiniLM-L6-v2 (~25 sec).

Is there a compromise? We will try gte-small model from Alibaba DAMO Academy. It's fairly compact (120 MB memory) and something you can run locally.

Performance when considering top 3 results is close to that of text-embedding-3-large and is actually better in Accuracy and Recall when looking at top 5 results. MRR is slightly worse as text-embedding-3-large tends to return relevant results closer to the top. Being a much smaller model overall, gte-small is on par with all-MiniLM-L6-v2 also in vector size (384 embedding dimension) and latency. Smaller vector size (384 vs. 3072 for text-embedding-3-large) also means it will be faster at ingesting new documents and executing queries against the vector DB.

No perfect answer but the key decision criteria are there for specific RAG requirements and use case. Many more models to consider. MTEB List is a good starting point. Key point to remember is that you always want to evaluate it with your own data.

Here is a summary of all three models with k = 3:

Just like with any development project, when building a RAG one size does not fit all. The size and type of data you plan to retrieve from your RAG will sway many engineering decisions. One of the first and key decisions about a RAG is what embedding model you want to use. You need to know how well the embedding model will help to find similarity between your RAG documents and typical user questions.

  • So how do you decide which embedding model is best for your data?

  • And what does ‘best’ even mean in the RAG context?

These questions are what we will answer next.

Note You can run all the examples in this blog via these two notebooks:

Even more fun would be to plug in your own RAG data and see which embedding model does better. Primary tools we'll be using for this are Okareo and ChromaDB.

Generate Your RAG Questions

Now we can choose an embedding model based on vibes or even check out the Massive Text Embedding Benchmark (MTEB) Leaderboard. But again, one size does not fit all use cases, and we want this to be a data driven decision. One of the most straightforward and popular uses of RAG is some sort of chatbot question answering. To capture this we want a meaningful number of questions that represent our typical users and data they will be accessing. Where do we get this ‘meaningful number’ of questions? We synthetically generate them. A good starting point is 100+ questions. Okareo SDKs (Python and Typescript) have a quick way of giving us this starting point.

Here is the key bit of Python code that generates the RAG questions:

okareo = Okareo(OKAREO_API_KEY) 
random_string = ''.join(random.choices(string.ascii_letters, k=5))
# Use the scenario set of documents to generate a scenario of questions 
generated_scenario = okareo.generate_scenario_set(
  ScenarioSetGenerate(
    name=f"Retrieval - Generated Scenario - {random_string}",
    source_scenario_id=document_scenario.scenario_id,
    number_examples=4, # Number of questions to generate for each document
    generation_type=ScenarioType.TEXT_REVERSE_QUESTION, # This type is for questions from the text
    generation_tone=GenerationTone.INFORMAL, # Specifying tone of the generated questions
    post_template="""{"question": "{generation.input}", "document": "{input}"}""",# for easy validation we are generating questions next to source documents
  )
)

# Print a link back to Okareo app to see the generated scenario 
print(f"See generated scenario in Okareo app: {generated_scenario.app_link}")

Note - Full notebook for Generating Your RAG Questions is available here

Measuring What Matters - Metrics

Now that we have the questions for evaluation it’s time to see if the RAG vector database returns the relevant documents for each question. As mentioned, the quality of results from the database is principally determined by the embedding model. How do we measure how often and in what order are the relevant documents returned across 100+ questions? If you have been playing with vector DBs, you also know about k value. k value is basically how many of the top scoring documents (via embedding function) should be returned by the vector DB. This can get a bit hairy, and sometimes even daunting. Let’s focus on a few simple metrics that get the job done most of the time.

  • Accuracy - For each question do retrieved documents contain at least one relevant document.

  • Recall - For each question do retrieved documents contain all the relevant documents.

  • MRR (Mean Reciprocal Rank) - How early in the results does the first relevant document appear. With the top result scoring the highest and the last result scoring the lowest.

Metrics above tell us if the vector DB together with the embedding model are giving us relevant results. But just like with any engineering decision the overall performance of the system needs to also consider latency and cost, among other factors. We need to consider the cost to maintain and operate the implementation on some infrastructure. We’ll try to bring all these factors together in the next section.

Let’s Kick Off This Model Showdown!

:::note The full embedding model comparison is laid out in this notebook: Model Showdown :::

We’ll be using ChromaDB to compare the different embedding models in the rest of this blog. It’s quick to set up and vector DB choice has less bearing on the embedding model selection.

We will start with all-MiniLM-L6-v2. It's a lightweight general purpose model from Sentence Transformers (free!) and it’s also set as the default embedding model for ChromaDB.

The interesting code portion for running a 100+ questions we created earlier against this model is:

embedding_model_name = "all-MiniLM-L6-v2" # This is the default SentenceTransformer model that ChromaDB uses to embed the documents 
default_sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2") 
model_under_test, collection = create_vector_collection(embedding_model_name, default_sentence_transformer_ef) 

# Perform a test run using the uploaded scenario  
test_run_item = model_under_test.run_test(
  scenario=scenario, # use the scenario uploaded earlier in this notebook
  name=f"RAG Comparison {embedding_model_name} - {random_string}",
  test_run_type=TestRunType.INFORMATION_RETRIEVAL, # specify that we are running a retrieval test
)

# Print a link back to Okareo app for evaluation visualization 
print(f"See results in Okareo app for embedding model {embedding_model_name}: {test_run_item.app_link}")

After running above here are the results you will see in Okareo app:

The leftmost card tells us that the chances of the first result being relevant are about 66% … it’s not great. If we consider the top 3 results (middle card), Accuracy and Recall improve to ~85% (better, but still not great) and even as much as 93% if we consider the top 5 results. Taking the top 5 results would make sense if we expected several relevant documents, say 2-3 for every question. But in this evaluation we are expecting an exact answer from a single document. In this case sending a context of 5 documents, 4 of which are not relevant is going to confuse the LLM trying to generate an answer to the user question. This is fairly intuitive and has been shown repeatedly in research and practical implementations that LLMs get confused by a lot of irrelevant context, irrespective of the context window maximum.

It would not be surprising if we try one of the embedding models from OpenAI. Let’s try text-embedding-3-large, the most powerful model at the time of writing this.

This is a major improvement overall. If we are considering the difference from top 3 to top 5 results Accuracy and Recall don’t improve much. This means most of the relevant documents are returned in the top 3 or not at all. Passing a context of 3 documents (selecting k = 3) gives us a much tighter context window for LLM in the answer generation step.

Ok, time to talk about latency and cost. all-MiniLM-L6-v2 runs locally and consumes only about 90 MB of memory. You still need to dedicate compute resources to it and size infrastructure proportionally to how much data volume you plan to ingest into vector DB/store and your retrieval throughput. Obviously something you need to operate and maintain yourself, but it has a relatively small footprint. text-embedding-3-large from OpenAI sits behind an API which makes it worry-free from the maintenance side. Being a much larger model and sitting behind an API also brings the cost (in tokens) and latency implications. For every document or question embedding one needs to make an API trip back and forth. For reference, on my average mac embedding 30 documents and 100+ questions took about twice as long with text-embedding-3-large (~52 sec) vs. all-MiniLM-L6-v2 (~25 sec).

Is there a compromise? We will try gte-small model from Alibaba DAMO Academy. It's fairly compact (120 MB memory) and something you can run locally.

Performance when considering top 3 results is close to that of text-embedding-3-large and is actually better in Accuracy and Recall when looking at top 5 results. MRR is slightly worse as text-embedding-3-large tends to return relevant results closer to the top. Being a much smaller model overall, gte-small is on par with all-MiniLM-L6-v2 also in vector size (384 embedding dimension) and latency. Smaller vector size (384 vs. 3072 for text-embedding-3-large) also means it will be faster at ingesting new documents and executing queries against the vector DB.

No perfect answer but the key decision criteria are there for specific RAG requirements and use case. Many more models to consider. MTEB List is a good starting point. Key point to remember is that you always want to evaluate it with your own data.

Here is a summary of all three models with k = 3:

Share:

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.