Optimizing Your RAG - Practical Guide for Software Engineers

RAG

Boris Selitser

,

Co-founder of Okareo

August 21, 2024

  • RAG isn't just about "slapping a Vector DB on an LLM"; building a real app involves many technical decisions.

  • Something No One Tells You: Your RAG performance metrics are only meaningful if you evaluate them using your own data.

  • Besides overall RAG performance, software engineers must evaluate performance of each RAG phase (e.g., Retrieval, Generation) individually due to their inherent complexity.

  • Poor performance in an upstream RAG phase has a cascading effect on the overall performance.

Principles of RAG - Fundamentally RAG is an architecture for building apps with foundational LLMs. Similar to the way you often separate frontend and backend code (though not always, as with a Node.js monolith), the key principle here is separating the LLM's emerging reasoning abilities—such as decision-making, planning, and problem detection—from the factual data it was trained on. Training data is inherently outdated, often contradictory (e.g., hallucinations), and does not include any private data. You want to maintain tight control on the data LLM operates on and reuse the best of LLM, its reasoning ability. Easier said than done…

RAG Use Cases: Since RAG is a general architecture, there are many ways it gets applied. It's not about choosing RAG over Fine-Tuning or RAG over Agents - it’s an “and.” RAG is a general method for getting your data into an LLM.

  • Question Answering - This is the most obvious and best publicized out there. Could be a feature in a larger FAQ, user help or knowledge app. RAG allows synthesizing answers that require multiple sources of information and opens up domain specific areas.

  • Chatbot/Co-Pilot - Prior-gen of chatbots were often rule based and scripted. Adding RAG allows conversational interaction with your domain specific knowledge, including nuances of dialog context and history.

  • Enterprise Search - Traditionally this was done via keyword indexing approaches (e.g Elasticsearch). Now RAG brings more relevant results with semantic retrieval and ability to answer more complex queries by correlating several data sources.

  • Agent Knowledge - In an agentic system individual agents performing tasks do so based on external environment knowledge. RAG could be employed in booking airline tickets based on current reservations database state or retrieving latest policy documents in deciding how to handle a particular request.

  • RAG from LinkedIn - This is a good example of a production RAG that many of us can see in action.

  • Many more …

What Are The Moving Parts?

Here's a high-level architecture I created for a typical RAG system. Of course, there’s no such thing as a "typical RAG"—it can be much more complex and tailored to specific use cases. Nonetheless, this diagram serves a purpose, as core phases and concepts shown below is what we consistently see across many production deployments.

A fair introduction to RAG is how to connect two new tools available to developers - Vector DB’s and LLMs. This is a solid starting point for building a proof of concept and the initial mental model. After that initial “aha” moment questions start to surface: How do you leverage RAG with the rest of your app, microservices or data? Chances are that the rest of your data is not in a single VectorDB and may never need to be. Considering the range of use cases above, how do you determine which data sources are relevant to a given request and how to properly generate a result?

Answering these questions eventually leads to an architecture that, in one form or another, includes the three core phases of RAG:

  1. Intent Detection + Routing - Identifying which data sources or microservices are needed to fulfill an incoming request—this is your RAG entry point.

  2. Retrieval - Finding relevant data from one or more of the identified data sources.

  3. Generation - Synthesizing the initial intent or goal (autonomous systems often have requests without user input) into a result using retrieved data.

Double clicking on each phase reveals more layers that need to be peeled away. Each phase requires composing models specifically for that purpose.

Something No One Tells You: As you assemble and fine-tune your RAG stack how do you evaluate its performance? How do you evaluate the Quality / Cost / Performance trade-offs? The key dimension to evaluate against is data, both in volume and diversity. But not just any data. Your RAG performance metrics are only meaningful if you evaluate them using your own data, use cases, and scenarios.

With this context, we’ll dive into each RAG phase next.

Intent Detection + Routing

Intent Detection

  • Identifying the Purpose: RAG input can come from either a user or the system itself, such as in agent-to-agent communication. The first step is understanding the intent or goal behind the input. For ambiguous queries, user clarification is necessary to ensure correct interpretation of the request and overall answer integrity.

  • Off-Topic, Out of Scope, or Malicious Queries: Before routing, it's crucial to filter out irrelevant or harmful queries. This serves the gatekeeper role for security and input validation of the RAG system.

  • Small Model for Intent Detection: Often, a smaller, more efficient classification model is used for intent detection to minimize latency, operating costs while ensuring accurate routing. Custom routing logic can further refine how the system directs queries to the appropriate data sources.

Routing

  • Choosing the Right Data Source: Once the intent is clear, the next step is deciding which data source to use. While much attention is on vector databases, your data might be in different relational databases, data lakes, graph databases, or a combination of these, depending on the specific use case. In some cases, you want to route the request to an API or perform a web search. This decision is key because the quality of the response heavily depends on the relevance and accuracy of the chosen data source.

  • Latency/Cost/Performance Trade-offs: For each stage, decisions need to be made regarding which models to use—whether it's a more powerful, latency-sensitive model or deterministic code. Depending on your routing needs you can start with custom logic and grow from there.

Query Extraction, Decomposition, Rewriting

  • Extracting Specific Parameters: Based on the initial input (user query or system), this stage extracts specific metadata and parameters that are crucial for retrieval phase. For example, this could be date ranges or product names, if data sources are split by product.

  • Query Decomposition and Rewriting: This could involve breaking up or rewriting the query to produce candidates more likely to get relevant results in retrieval.

How Okareo Can Help

  • Classification: Okareo provides tools that evaluate the classification of user intents, ensuring that the system correctly identifies and categorizes requests. Learn more about Okareo's classification evals.

  • Using Synthetic Data to Improve Test Coverage and Performance: To enhance the accuracy and reliability of intent detection, Okareo offers synthetic data generation. This helps improve test coverage and overall system performance. Read more on how synthetic data can help.

  • Intent Detection Fine-Tuning: Okareo also supports fine-tuning models for intent detection, allowing you to further optimize understanding and routing of specific types of queries. Explore fine-tuning options.

Retrieval

It's important to understand that while vector databases are powerful, they are just one tool among many in the retrieval process. We'll use vector based retrieval as an example in this section of the blog, but other methods deserve equal consideration. Because two-stage retrieval has been around for some time and there is a lot of coverage for it, we'll only provide a brief overview.

1st Stage - Retrieve: In this stage, a broad set of potentially relevant data points is retrieved based on the input query. This is often done using fast, basic vector search to quickly narrow down a large dataset.

  • Baseline on BM25: Despite the hype around vector models, it’s often best to start with a baseline using basic keyword search algorithms, e.g BM25.

  • Embedding Model Selection: Embedding model is leveraged to find the best matching results for the query based on vector embeddings. Embedding model is the key choice to evaluate in terms of latency, cost, and retrieval performance. Many smaller models are quite powerful and can be ‘good enough’. Compare performance gains to your BM25 baselines.

  • Hybrid Search: A hybrid setup often works best, combining modern sparse (e.g. SPLADE) and dense vector models. In many cases, a hybrid setup delivers better results for general use cases.

2nd Stage - ReRanking: After the initial set of results is gathered, a more powerful, but slower ReRanker model is applied to reorder the results based on their relevance to the query. ReRanker models are more computationally expensive, but usually more accurate than embedding models. Hence leaving them for the second stage to operate on a smaller set of narrowed results.

Optimizing Your Retrieval Stack: Build evaluations for Retrieve and ReRanking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.

To further optimize your RAG you will want to focus on striking the right balance between model footprint and performance for both Retrieve and ReRank stages. Each model could be further improved for your domain by fine-tuning on synthetic and real data.

How Okareo Can Help

  • Explore Okareo's retrieval evaluation: Due to intrinsic complexity of retrieval, there is a need for performance evaluation dedicated to this phase. If you can’t isolate and trap problems here, they cascade into downstream failures.

  • Optimizing For Your Data: Okareo provides guidance on selecting an embedding model that fits your data. Learn more about embedding model selection.

  • Synthetic Generation of Evaluation Data

Generation

In the Generation phase, the system takes the relevant data retrieved in the previous steps and uses it to generate the final output. This step is crucial as it synthesizes the input data into a coherent and meaningful response, whether it's answering a query, generating text, or making decisions.

  • Reasoning and Decision-Making Tasks: In this phase, the system undertakes complex reasoning and decision-making tasks. These tasks can complicate performance evaluations, as they require the model to not only generate text but also make logical decisions based on the context provided.

  • Cycle Between Generation Model and Reflective Model: In many implementations, before the result is returned from the Generation Model it is reviewed for errors by Reflective Model or ‘quality model’. Reflective Model could be a more powerful or the same model as Generation model and takes advantage of LLMs reflective property to find errors and inconsistencies when that is the focus of the prompt.

How Okareo Can Help


  • RAG isn't just about "slapping a Vector DB on an LLM"; building a real app involves many technical decisions.

  • Something No One Tells You: Your RAG performance metrics are only meaningful if you evaluate them using your own data.

  • Besides overall RAG performance, software engineers must evaluate performance of each RAG phase (e.g., Retrieval, Generation) individually due to their inherent complexity.

  • Poor performance in an upstream RAG phase has a cascading effect on the overall performance.

Principles of RAG - Fundamentally RAG is an architecture for building apps with foundational LLMs. Similar to the way you often separate frontend and backend code (though not always, as with a Node.js monolith), the key principle here is separating the LLM's emerging reasoning abilities—such as decision-making, planning, and problem detection—from the factual data it was trained on. Training data is inherently outdated, often contradictory (e.g., hallucinations), and does not include any private data. You want to maintain tight control on the data LLM operates on and reuse the best of LLM, its reasoning ability. Easier said than done…

RAG Use Cases: Since RAG is a general architecture, there are many ways it gets applied. It's not about choosing RAG over Fine-Tuning or RAG over Agents - it’s an “and.” RAG is a general method for getting your data into an LLM.

  • Question Answering - This is the most obvious and best publicized out there. Could be a feature in a larger FAQ, user help or knowledge app. RAG allows synthesizing answers that require multiple sources of information and opens up domain specific areas.

  • Chatbot/Co-Pilot - Prior-gen of chatbots were often rule based and scripted. Adding RAG allows conversational interaction with your domain specific knowledge, including nuances of dialog context and history.

  • Enterprise Search - Traditionally this was done via keyword indexing approaches (e.g Elasticsearch). Now RAG brings more relevant results with semantic retrieval and ability to answer more complex queries by correlating several data sources.

  • Agent Knowledge - In an agentic system individual agents performing tasks do so based on external environment knowledge. RAG could be employed in booking airline tickets based on current reservations database state or retrieving latest policy documents in deciding how to handle a particular request.

  • RAG from LinkedIn - This is a good example of a production RAG that many of us can see in action.

  • Many more …

What Are The Moving Parts?

Here's a high-level architecture I created for a typical RAG system. Of course, there’s no such thing as a "typical RAG"—it can be much more complex and tailored to specific use cases. Nonetheless, this diagram serves a purpose, as core phases and concepts shown below is what we consistently see across many production deployments.

A fair introduction to RAG is how to connect two new tools available to developers - Vector DB’s and LLMs. This is a solid starting point for building a proof of concept and the initial mental model. After that initial “aha” moment questions start to surface: How do you leverage RAG with the rest of your app, microservices or data? Chances are that the rest of your data is not in a single VectorDB and may never need to be. Considering the range of use cases above, how do you determine which data sources are relevant to a given request and how to properly generate a result?

Answering these questions eventually leads to an architecture that, in one form or another, includes the three core phases of RAG:

  1. Intent Detection + Routing - Identifying which data sources or microservices are needed to fulfill an incoming request—this is your RAG entry point.

  2. Retrieval - Finding relevant data from one or more of the identified data sources.

  3. Generation - Synthesizing the initial intent or goal (autonomous systems often have requests without user input) into a result using retrieved data.

Double clicking on each phase reveals more layers that need to be peeled away. Each phase requires composing models specifically for that purpose.

Something No One Tells You: As you assemble and fine-tune your RAG stack how do you evaluate its performance? How do you evaluate the Quality / Cost / Performance trade-offs? The key dimension to evaluate against is data, both in volume and diversity. But not just any data. Your RAG performance metrics are only meaningful if you evaluate them using your own data, use cases, and scenarios.

With this context, we’ll dive into each RAG phase next.

Intent Detection + Routing

Intent Detection

  • Identifying the Purpose: RAG input can come from either a user or the system itself, such as in agent-to-agent communication. The first step is understanding the intent or goal behind the input. For ambiguous queries, user clarification is necessary to ensure correct interpretation of the request and overall answer integrity.

  • Off-Topic, Out of Scope, or Malicious Queries: Before routing, it's crucial to filter out irrelevant or harmful queries. This serves the gatekeeper role for security and input validation of the RAG system.

  • Small Model for Intent Detection: Often, a smaller, more efficient classification model is used for intent detection to minimize latency, operating costs while ensuring accurate routing. Custom routing logic can further refine how the system directs queries to the appropriate data sources.

Routing

  • Choosing the Right Data Source: Once the intent is clear, the next step is deciding which data source to use. While much attention is on vector databases, your data might be in different relational databases, data lakes, graph databases, or a combination of these, depending on the specific use case. In some cases, you want to route the request to an API or perform a web search. This decision is key because the quality of the response heavily depends on the relevance and accuracy of the chosen data source.

  • Latency/Cost/Performance Trade-offs: For each stage, decisions need to be made regarding which models to use—whether it's a more powerful, latency-sensitive model or deterministic code. Depending on your routing needs you can start with custom logic and grow from there.

Query Extraction, Decomposition, Rewriting

  • Extracting Specific Parameters: Based on the initial input (user query or system), this stage extracts specific metadata and parameters that are crucial for retrieval phase. For example, this could be date ranges or product names, if data sources are split by product.

  • Query Decomposition and Rewriting: This could involve breaking up or rewriting the query to produce candidates more likely to get relevant results in retrieval.

How Okareo Can Help

  • Classification: Okareo provides tools that evaluate the classification of user intents, ensuring that the system correctly identifies and categorizes requests. Learn more about Okareo's classification evals.

  • Using Synthetic Data to Improve Test Coverage and Performance: To enhance the accuracy and reliability of intent detection, Okareo offers synthetic data generation. This helps improve test coverage and overall system performance. Read more on how synthetic data can help.

  • Intent Detection Fine-Tuning: Okareo also supports fine-tuning models for intent detection, allowing you to further optimize understanding and routing of specific types of queries. Explore fine-tuning options.

Retrieval

It's important to understand that while vector databases are powerful, they are just one tool among many in the retrieval process. We'll use vector based retrieval as an example in this section of the blog, but other methods deserve equal consideration. Because two-stage retrieval has been around for some time and there is a lot of coverage for it, we'll only provide a brief overview.

1st Stage - Retrieve: In this stage, a broad set of potentially relevant data points is retrieved based on the input query. This is often done using fast, basic vector search to quickly narrow down a large dataset.

  • Baseline on BM25: Despite the hype around vector models, it’s often best to start with a baseline using basic keyword search algorithms, e.g BM25.

  • Embedding Model Selection: Embedding model is leveraged to find the best matching results for the query based on vector embeddings. Embedding model is the key choice to evaluate in terms of latency, cost, and retrieval performance. Many smaller models are quite powerful and can be ‘good enough’. Compare performance gains to your BM25 baselines.

  • Hybrid Search: A hybrid setup often works best, combining modern sparse (e.g. SPLADE) and dense vector models. In many cases, a hybrid setup delivers better results for general use cases.

2nd Stage - ReRanking: After the initial set of results is gathered, a more powerful, but slower ReRanker model is applied to reorder the results based on their relevance to the query. ReRanker models are more computationally expensive, but usually more accurate than embedding models. Hence leaving them for the second stage to operate on a smaller set of narrowed results.

Optimizing Your Retrieval Stack: Build evaluations for Retrieve and ReRanking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.

To further optimize your RAG you will want to focus on striking the right balance between model footprint and performance for both Retrieve and ReRank stages. Each model could be further improved for your domain by fine-tuning on synthetic and real data.

How Okareo Can Help

  • Explore Okareo's retrieval evaluation: Due to intrinsic complexity of retrieval, there is a need for performance evaluation dedicated to this phase. If you can’t isolate and trap problems here, they cascade into downstream failures.

  • Optimizing For Your Data: Okareo provides guidance on selecting an embedding model that fits your data. Learn more about embedding model selection.

  • Synthetic Generation of Evaluation Data

Generation

In the Generation phase, the system takes the relevant data retrieved in the previous steps and uses it to generate the final output. This step is crucial as it synthesizes the input data into a coherent and meaningful response, whether it's answering a query, generating text, or making decisions.

  • Reasoning and Decision-Making Tasks: In this phase, the system undertakes complex reasoning and decision-making tasks. These tasks can complicate performance evaluations, as they require the model to not only generate text but also make logical decisions based on the context provided.

  • Cycle Between Generation Model and Reflective Model: In many implementations, before the result is returned from the Generation Model it is reviewed for errors by Reflective Model or ‘quality model’. Reflective Model could be a more powerful or the same model as Generation model and takes advantage of LLMs reflective property to find errors and inconsistencies when that is the focus of the prompt.

How Okareo Can Help


  • RAG isn't just about "slapping a Vector DB on an LLM"; building a real app involves many technical decisions.

  • Something No One Tells You: Your RAG performance metrics are only meaningful if you evaluate them using your own data.

  • Besides overall RAG performance, software engineers must evaluate performance of each RAG phase (e.g., Retrieval, Generation) individually due to their inherent complexity.

  • Poor performance in an upstream RAG phase has a cascading effect on the overall performance.

Principles of RAG - Fundamentally RAG is an architecture for building apps with foundational LLMs. Similar to the way you often separate frontend and backend code (though not always, as with a Node.js monolith), the key principle here is separating the LLM's emerging reasoning abilities—such as decision-making, planning, and problem detection—from the factual data it was trained on. Training data is inherently outdated, often contradictory (e.g., hallucinations), and does not include any private data. You want to maintain tight control on the data LLM operates on and reuse the best of LLM, its reasoning ability. Easier said than done…

RAG Use Cases: Since RAG is a general architecture, there are many ways it gets applied. It's not about choosing RAG over Fine-Tuning or RAG over Agents - it’s an “and.” RAG is a general method for getting your data into an LLM.

  • Question Answering - This is the most obvious and best publicized out there. Could be a feature in a larger FAQ, user help or knowledge app. RAG allows synthesizing answers that require multiple sources of information and opens up domain specific areas.

  • Chatbot/Co-Pilot - Prior-gen of chatbots were often rule based and scripted. Adding RAG allows conversational interaction with your domain specific knowledge, including nuances of dialog context and history.

  • Enterprise Search - Traditionally this was done via keyword indexing approaches (e.g Elasticsearch). Now RAG brings more relevant results with semantic retrieval and ability to answer more complex queries by correlating several data sources.

  • Agent Knowledge - In an agentic system individual agents performing tasks do so based on external environment knowledge. RAG could be employed in booking airline tickets based on current reservations database state or retrieving latest policy documents in deciding how to handle a particular request.

  • RAG from LinkedIn - This is a good example of a production RAG that many of us can see in action.

  • Many more …

What Are The Moving Parts?

Here's a high-level architecture I created for a typical RAG system. Of course, there’s no such thing as a "typical RAG"—it can be much more complex and tailored to specific use cases. Nonetheless, this diagram serves a purpose, as core phases and concepts shown below is what we consistently see across many production deployments.

A fair introduction to RAG is how to connect two new tools available to developers - Vector DB’s and LLMs. This is a solid starting point for building a proof of concept and the initial mental model. After that initial “aha” moment questions start to surface: How do you leverage RAG with the rest of your app, microservices or data? Chances are that the rest of your data is not in a single VectorDB and may never need to be. Considering the range of use cases above, how do you determine which data sources are relevant to a given request and how to properly generate a result?

Answering these questions eventually leads to an architecture that, in one form or another, includes the three core phases of RAG:

  1. Intent Detection + Routing - Identifying which data sources or microservices are needed to fulfill an incoming request—this is your RAG entry point.

  2. Retrieval - Finding relevant data from one or more of the identified data sources.

  3. Generation - Synthesizing the initial intent or goal (autonomous systems often have requests without user input) into a result using retrieved data.

Double clicking on each phase reveals more layers that need to be peeled away. Each phase requires composing models specifically for that purpose.

Something No One Tells You: As you assemble and fine-tune your RAG stack how do you evaluate its performance? How do you evaluate the Quality / Cost / Performance trade-offs? The key dimension to evaluate against is data, both in volume and diversity. But not just any data. Your RAG performance metrics are only meaningful if you evaluate them using your own data, use cases, and scenarios.

With this context, we’ll dive into each RAG phase next.

Intent Detection + Routing

Intent Detection

  • Identifying the Purpose: RAG input can come from either a user or the system itself, such as in agent-to-agent communication. The first step is understanding the intent or goal behind the input. For ambiguous queries, user clarification is necessary to ensure correct interpretation of the request and overall answer integrity.

  • Off-Topic, Out of Scope, or Malicious Queries: Before routing, it's crucial to filter out irrelevant or harmful queries. This serves the gatekeeper role for security and input validation of the RAG system.

  • Small Model for Intent Detection: Often, a smaller, more efficient classification model is used for intent detection to minimize latency, operating costs while ensuring accurate routing. Custom routing logic can further refine how the system directs queries to the appropriate data sources.

Routing

  • Choosing the Right Data Source: Once the intent is clear, the next step is deciding which data source to use. While much attention is on vector databases, your data might be in different relational databases, data lakes, graph databases, or a combination of these, depending on the specific use case. In some cases, you want to route the request to an API or perform a web search. This decision is key because the quality of the response heavily depends on the relevance and accuracy of the chosen data source.

  • Latency/Cost/Performance Trade-offs: For each stage, decisions need to be made regarding which models to use—whether it's a more powerful, latency-sensitive model or deterministic code. Depending on your routing needs you can start with custom logic and grow from there.

Query Extraction, Decomposition, Rewriting

  • Extracting Specific Parameters: Based on the initial input (user query or system), this stage extracts specific metadata and parameters that are crucial for retrieval phase. For example, this could be date ranges or product names, if data sources are split by product.

  • Query Decomposition and Rewriting: This could involve breaking up or rewriting the query to produce candidates more likely to get relevant results in retrieval.

How Okareo Can Help

  • Classification: Okareo provides tools that evaluate the classification of user intents, ensuring that the system correctly identifies and categorizes requests. Learn more about Okareo's classification evals.

  • Using Synthetic Data to Improve Test Coverage and Performance: To enhance the accuracy and reliability of intent detection, Okareo offers synthetic data generation. This helps improve test coverage and overall system performance. Read more on how synthetic data can help.

  • Intent Detection Fine-Tuning: Okareo also supports fine-tuning models for intent detection, allowing you to further optimize understanding and routing of specific types of queries. Explore fine-tuning options.

Retrieval

It's important to understand that while vector databases are powerful, they are just one tool among many in the retrieval process. We'll use vector based retrieval as an example in this section of the blog, but other methods deserve equal consideration. Because two-stage retrieval has been around for some time and there is a lot of coverage for it, we'll only provide a brief overview.

1st Stage - Retrieve: In this stage, a broad set of potentially relevant data points is retrieved based on the input query. This is often done using fast, basic vector search to quickly narrow down a large dataset.

  • Baseline on BM25: Despite the hype around vector models, it’s often best to start with a baseline using basic keyword search algorithms, e.g BM25.

  • Embedding Model Selection: Embedding model is leveraged to find the best matching results for the query based on vector embeddings. Embedding model is the key choice to evaluate in terms of latency, cost, and retrieval performance. Many smaller models are quite powerful and can be ‘good enough’. Compare performance gains to your BM25 baselines.

  • Hybrid Search: A hybrid setup often works best, combining modern sparse (e.g. SPLADE) and dense vector models. In many cases, a hybrid setup delivers better results for general use cases.

2nd Stage - ReRanking: After the initial set of results is gathered, a more powerful, but slower ReRanker model is applied to reorder the results based on their relevance to the query. ReRanker models are more computationally expensive, but usually more accurate than embedding models. Hence leaving them for the second stage to operate on a smaller set of narrowed results.

Optimizing Your Retrieval Stack: Build evaluations for Retrieve and ReRanking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.

To further optimize your RAG you will want to focus on striking the right balance between model footprint and performance for both Retrieve and ReRank stages. Each model could be further improved for your domain by fine-tuning on synthetic and real data.

How Okareo Can Help

  • Explore Okareo's retrieval evaluation: Due to intrinsic complexity of retrieval, there is a need for performance evaluation dedicated to this phase. If you can’t isolate and trap problems here, they cascade into downstream failures.

  • Optimizing For Your Data: Okareo provides guidance on selecting an embedding model that fits your data. Learn more about embedding model selection.

  • Synthetic Generation of Evaluation Data

Generation

In the Generation phase, the system takes the relevant data retrieved in the previous steps and uses it to generate the final output. This step is crucial as it synthesizes the input data into a coherent and meaningful response, whether it's answering a query, generating text, or making decisions.

  • Reasoning and Decision-Making Tasks: In this phase, the system undertakes complex reasoning and decision-making tasks. These tasks can complicate performance evaluations, as they require the model to not only generate text but also make logical decisions based on the context provided.

  • Cycle Between Generation Model and Reflective Model: In many implementations, before the result is returned from the Generation Model it is reviewed for errors by Reflective Model or ‘quality model’. Reflective Model could be a more powerful or the same model as Generation model and takes advantage of LLMs reflective property to find errors and inconsistencies when that is the focus of the prompt.

How Okareo Can Help


  • RAG isn't just about "slapping a Vector DB on an LLM"; building a real app involves many technical decisions.

  • Something No One Tells You: Your RAG performance metrics are only meaningful if you evaluate them using your own data.

  • Besides overall RAG performance, software engineers must evaluate performance of each RAG phase (e.g., Retrieval, Generation) individually due to their inherent complexity.

  • Poor performance in an upstream RAG phase has a cascading effect on the overall performance.

Principles of RAG - Fundamentally RAG is an architecture for building apps with foundational LLMs. Similar to the way you often separate frontend and backend code (though not always, as with a Node.js monolith), the key principle here is separating the LLM's emerging reasoning abilities—such as decision-making, planning, and problem detection—from the factual data it was trained on. Training data is inherently outdated, often contradictory (e.g., hallucinations), and does not include any private data. You want to maintain tight control on the data LLM operates on and reuse the best of LLM, its reasoning ability. Easier said than done…

RAG Use Cases: Since RAG is a general architecture, there are many ways it gets applied. It's not about choosing RAG over Fine-Tuning or RAG over Agents - it’s an “and.” RAG is a general method for getting your data into an LLM.

  • Question Answering - This is the most obvious and best publicized out there. Could be a feature in a larger FAQ, user help or knowledge app. RAG allows synthesizing answers that require multiple sources of information and opens up domain specific areas.

  • Chatbot/Co-Pilot - Prior-gen of chatbots were often rule based and scripted. Adding RAG allows conversational interaction with your domain specific knowledge, including nuances of dialog context and history.

  • Enterprise Search - Traditionally this was done via keyword indexing approaches (e.g Elasticsearch). Now RAG brings more relevant results with semantic retrieval and ability to answer more complex queries by correlating several data sources.

  • Agent Knowledge - In an agentic system individual agents performing tasks do so based on external environment knowledge. RAG could be employed in booking airline tickets based on current reservations database state or retrieving latest policy documents in deciding how to handle a particular request.

  • RAG from LinkedIn - This is a good example of a production RAG that many of us can see in action.

  • Many more …

What Are The Moving Parts?

Here's a high-level architecture I created for a typical RAG system. Of course, there’s no such thing as a "typical RAG"—it can be much more complex and tailored to specific use cases. Nonetheless, this diagram serves a purpose, as core phases and concepts shown below is what we consistently see across many production deployments.

A fair introduction to RAG is how to connect two new tools available to developers - Vector DB’s and LLMs. This is a solid starting point for building a proof of concept and the initial mental model. After that initial “aha” moment questions start to surface: How do you leverage RAG with the rest of your app, microservices or data? Chances are that the rest of your data is not in a single VectorDB and may never need to be. Considering the range of use cases above, how do you determine which data sources are relevant to a given request and how to properly generate a result?

Answering these questions eventually leads to an architecture that, in one form or another, includes the three core phases of RAG:

  1. Intent Detection + Routing - Identifying which data sources or microservices are needed to fulfill an incoming request—this is your RAG entry point.

  2. Retrieval - Finding relevant data from one or more of the identified data sources.

  3. Generation - Synthesizing the initial intent or goal (autonomous systems often have requests without user input) into a result using retrieved data.

Double clicking on each phase reveals more layers that need to be peeled away. Each phase requires composing models specifically for that purpose.

Something No One Tells You: As you assemble and fine-tune your RAG stack how do you evaluate its performance? How do you evaluate the Quality / Cost / Performance trade-offs? The key dimension to evaluate against is data, both in volume and diversity. But not just any data. Your RAG performance metrics are only meaningful if you evaluate them using your own data, use cases, and scenarios.

With this context, we’ll dive into each RAG phase next.

Intent Detection + Routing

Intent Detection

  • Identifying the Purpose: RAG input can come from either a user or the system itself, such as in agent-to-agent communication. The first step is understanding the intent or goal behind the input. For ambiguous queries, user clarification is necessary to ensure correct interpretation of the request and overall answer integrity.

  • Off-Topic, Out of Scope, or Malicious Queries: Before routing, it's crucial to filter out irrelevant or harmful queries. This serves the gatekeeper role for security and input validation of the RAG system.

  • Small Model for Intent Detection: Often, a smaller, more efficient classification model is used for intent detection to minimize latency, operating costs while ensuring accurate routing. Custom routing logic can further refine how the system directs queries to the appropriate data sources.

Routing

  • Choosing the Right Data Source: Once the intent is clear, the next step is deciding which data source to use. While much attention is on vector databases, your data might be in different relational databases, data lakes, graph databases, or a combination of these, depending on the specific use case. In some cases, you want to route the request to an API or perform a web search. This decision is key because the quality of the response heavily depends on the relevance and accuracy of the chosen data source.

  • Latency/Cost/Performance Trade-offs: For each stage, decisions need to be made regarding which models to use—whether it's a more powerful, latency-sensitive model or deterministic code. Depending on your routing needs you can start with custom logic and grow from there.

Query Extraction, Decomposition, Rewriting

  • Extracting Specific Parameters: Based on the initial input (user query or system), this stage extracts specific metadata and parameters that are crucial for retrieval phase. For example, this could be date ranges or product names, if data sources are split by product.

  • Query Decomposition and Rewriting: This could involve breaking up or rewriting the query to produce candidates more likely to get relevant results in retrieval.

How Okareo Can Help

  • Classification: Okareo provides tools that evaluate the classification of user intents, ensuring that the system correctly identifies and categorizes requests. Learn more about Okareo's classification evals.

  • Using Synthetic Data to Improve Test Coverage and Performance: To enhance the accuracy and reliability of intent detection, Okareo offers synthetic data generation. This helps improve test coverage and overall system performance. Read more on how synthetic data can help.

  • Intent Detection Fine-Tuning: Okareo also supports fine-tuning models for intent detection, allowing you to further optimize understanding and routing of specific types of queries. Explore fine-tuning options.

Retrieval

It's important to understand that while vector databases are powerful, they are just one tool among many in the retrieval process. We'll use vector based retrieval as an example in this section of the blog, but other methods deserve equal consideration. Because two-stage retrieval has been around for some time and there is a lot of coverage for it, we'll only provide a brief overview.

1st Stage - Retrieve: In this stage, a broad set of potentially relevant data points is retrieved based on the input query. This is often done using fast, basic vector search to quickly narrow down a large dataset.

  • Baseline on BM25: Despite the hype around vector models, it’s often best to start with a baseline using basic keyword search algorithms, e.g BM25.

  • Embedding Model Selection: Embedding model is leveraged to find the best matching results for the query based on vector embeddings. Embedding model is the key choice to evaluate in terms of latency, cost, and retrieval performance. Many smaller models are quite powerful and can be ‘good enough’. Compare performance gains to your BM25 baselines.

  • Hybrid Search: A hybrid setup often works best, combining modern sparse (e.g. SPLADE) and dense vector models. In many cases, a hybrid setup delivers better results for general use cases.

2nd Stage - ReRanking: After the initial set of results is gathered, a more powerful, but slower ReRanker model is applied to reorder the results based on their relevance to the query. ReRanker models are more computationally expensive, but usually more accurate than embedding models. Hence leaving them for the second stage to operate on a smaller set of narrowed results.

Optimizing Your Retrieval Stack: Build evaluations for Retrieve and ReRanking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.

To further optimize your RAG you will want to focus on striking the right balance between model footprint and performance for both Retrieve and ReRank stages. Each model could be further improved for your domain by fine-tuning on synthetic and real data.

How Okareo Can Help

  • Explore Okareo's retrieval evaluation: Due to intrinsic complexity of retrieval, there is a need for performance evaluation dedicated to this phase. If you can’t isolate and trap problems here, they cascade into downstream failures.

  • Optimizing For Your Data: Okareo provides guidance on selecting an embedding model that fits your data. Learn more about embedding model selection.

  • Synthetic Generation of Evaluation Data

Generation

In the Generation phase, the system takes the relevant data retrieved in the previous steps and uses it to generate the final output. This step is crucial as it synthesizes the input data into a coherent and meaningful response, whether it's answering a query, generating text, or making decisions.

  • Reasoning and Decision-Making Tasks: In this phase, the system undertakes complex reasoning and decision-making tasks. These tasks can complicate performance evaluations, as they require the model to not only generate text but also make logical decisions based on the context provided.

  • Cycle Between Generation Model and Reflective Model: In many implementations, before the result is returned from the Generation Model it is reviewed for errors by Reflective Model or ‘quality model’. Reflective Model could be a more powerful or the same model as Generation model and takes advantage of LLMs reflective property to find errors and inconsistencies when that is the focus of the prompt.

How Okareo Can Help


Share:

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.