Understanding RAG architecture and its key components

RAG

Boris Selitser

,

Co-founder of Okareo

August 21, 2024

RAG is fast becoming an industry standard, especially when users need real-time access to large amounts of dynamic data. Today, many of the largest AI chatbots (including ChatGPT and Bard) use some variation of RAG architecture, and developers of LLM-powered apps are also choosing to use it.

But RAG isn't just about "slapping a Vector DB on an LLM"; integrating a RAG system into a real world app involves many technical decisions. Here, we define RAG architecture — explaining how the key components work and fit together.

What is RAG and when to use it?

Fundamentally, RAG is an architecture that is used to improve the performance of foundational LLMs, and is worth using if you're a developer of LLM-powered apps. 

RAG augments what a generative LLM can do by including a retrieval mechanism that fetches relevant data from external, specialized sources, which is then fed into the LLM (along with the original query) for added context. This enhances accuracy and contextual relevance of the responses from generative models and reduces hallucinations. It also opens the results to external sources including real-time data unavailable in the LLM training data which usually has a cut-off date.

A key principle of RAG is that it separates the LLM's emerging reasoning abilities (such as decision-making, planning, and problem detection) from the factual data it was trained on.

Since RAG is a general architecture, there are many ways to apply it, which explains its popularity. Some common use cases are:

  • Question Answering - This is the most obvious and best publicized use case for RAG out there. This could be a feature in a larger FAQ, user help base or knowledge app. RAG allows for multiple sources of information to be used to synthesize answers, including domain-specific data (e.g. proprietary information about clinical trials, or engineering documentation about a specific product) that might not have been part of the LLM's more general original training data.

  • Chatbot/Co-Pilot - Previous generations of chatbots were often rule-based and scripted. Adding RAG allows conversational interaction with your domain specific knowledge, including nuances of dialog context and history.

  • Enterprise Search - Traditionally this was done via keyword indexing approaches (e.g Elasticsearch). Now RAG brings more relevant results with semantic retrieval and ability to answer more complex queries by correlating several data sources.

  • Agent Knowledge - In agentic systems, individual agents perform tasks based on external environment knowledge. An agent responsible for booking airline tickets can now use RAG to pull data from external knowledge bases, such as the current reservations database or the latest policy documents store, and use this information to decide how to handle a particular request.

For a detailed real-world example of a RAG system in use, read about how LinkedIn created a RAG system with multiple agents using RAG to retrieve and synthesize information from various sources including real-time data from the LinkedIn and Bing APIs.

What is the architecture of a RAG system?

Here's a high-level architecture I created for a typical RAG system. Of course, there’s no such thing as a "typical RAG"—it can be much more complex and tailored to specific use cases. Nonetheless, this diagram serves a purpose, as the core phases and concepts shown below are what we consistently see across many production deployments.

It's likely if you're using a RAG architecture that some of your data will be in a vector DB. You're also considering how to connect it to an LLM. But you likely also have other data relating to your app that's not in a vector database and it may never need to be. This begs the question, how do you leverage RAG with the rest of your app, microservices or data? Considering the range of use cases above, how do you determine which data sources are relevant to a given request?

Questions like this are why intent detection and routing has now become a key component of many RAG systems. The diagram above shows that there are three core phrases of RAG:

  1. Intent Detection + Routing - Identifying which data sources or microservices are needed to fulfill an incoming request — this is your RAG entry point.

  2. Retrieval - Finding relevant data from one or more of the identified data sources.

  3. Generation - Synthesizing the initial intent or goal (autonomous systems often have requests without user input) into a result using retrieved data.

Each of these three phases is backed by a model that you need to compose specifically for that purpose, and problems with any of these phases will translate into downstream failures. That's why it's important to evaluate each phase of your RAG separately.

Something No One Tells You:When evaluating your RAG models, the key dimension to evaluate against is data, both in volume and diversity. But not just any data. Your RAG performance metrics are only meaningful if you evaluate them using your own data, use cases, and scenarios.

With this context, we’ll dive into each RAG phase next.

Intent Detection and Routing

Intent detection and routing  involves understanding the intent or goal behind a RAG input, classifying it into one of several predefined categories and then routing the input into an appropriate system or data store.

Intent detection can also be used to filter out off-topic, out of scope or malicious queries, or to ask for user clarification when a query is ambiguous.

A classification model is typically used for intent direction, but it's worth considering using a smaller, more efficient classification model for this to minimize latency and operating costs. What you may lack in precision with a smaller model can often be made up for by using custom routing logic to further refine how the system directs queries to the appropriate data sources.

Once your routing system has determined the correct data store or API for the retrieval phase, sometimes the query needs to be reformulated to match the structure or language expected by that system. This involves:

  • Query Extraction: Extracting specific parameters from the initial input that are crucial for the retrieval phase. For example, this could be date ranges or product names, if data sources are split by product.

  • Query Decomposition and Rewriting: Breaking up or rewriting the query to produce candidates more likely to get relevant results in retrieval.

How Okareo Can Help

  • Classification: Okareo provides tools that evaluate the classification of user intents, ensuring that the system correctly identifies and categorizes requests. Learn more about Okareo's classification evals.

  • Using Synthetic Data to Improve Test Coverage and Performance: To enhance the accuracy and reliability of intent detection, Okareo offers synthetic data generation. This helps improve test coverage and overall system performance. Read more on how synthetic data can help.

  • Intent Detection Fine-Tuning: Okareo also supports fine-tuning models for intent detection, allowing you to further optimize understanding and routing of specific types of queries. Explore fine-tuning options.

Retrieval

The retrieval phase of RAG involves retrieving a list of relevant data points from an external source, and then ranking which of these are the most relevant, so that later, the most relevant data can be sent to your generative model along with the original input. This improves the accuracy, relevance and depth of the response from the LLM.

The most commonly used tool in the retrieval process is the vector database, but many other data sources can also be used for this, including Graph DBs, SQL/NoSQL DBs, keyword search engines like Elasticsearch or external APIs.

In this article, we’re going to focus on vector based retrieval, but other methods deserve equal consideration. 

1st Stage - Retrieve

In this stage, a broad set of potentially relevant data points is retrieved based on the input query. This is often done using fast, basic vector search across a vector database to quickly narrow down a large dataset.

  • Vector database: Your data can be stored as vector embeddings in a vector database, which has built-in algorithms for similarity search. You use an embedding model to convert text into vector embeddings before saving them in the vector DB.

  • Embedding model: The RAG input is also converted to a vector embedding via your embedding model, which can then be used to query the vector database for other similar vector embeddings. The vector database uses a similarity search algorithm to find the closest matches, and returns a list of them. Large embedding models can be one of the most expensive components in a RAG system,. So it's worth noting that many smaller models are quite powerful and can be ‘good enough’ in terms of latency and retrieval performance.

  • Vector embedding: This is a type of vector, which is a mathematical way to represent data that makes it easy to compare similarity with another vector.

  • Querying the vector DB: The user query is converted to a vector embedding and sent to the vector DB, which uses a similarity search algorithm to find the closest matches. It then returns a list of the closest matches.

2nd Stage - ReRanking

After the initial set of results is gathered, a more powerful, but slower ReRanker model is applied to reorder the results based on their relevance to the query. ReRanker models are more computationally expensive, but usually more accurate than embedding models. Hence leaving them for the second stage to operate on a smaller set of narrowed results.

How Okareo Can Help

  • Explore Okareo's retrieval evaluation: Due to intrinsic complexity of retrieval, there is a need for performance evaluation dedicated to the retrieval phase of RAG. If you can’t isolate and trap problems here, they cascade into downstream failures.

  • Optimizing For Your Data: Okareo provides guidance on selecting an embedding model that fits your data. Learn more about embedding model selection.

  • Synthetic Generation of Evaluation Data: Build evaluations for Retrieve and ReRanking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.

Generation

In the Generation phase, the system takes the relevant data retrieved in the previous steps and uses it to generate the final output. This step is crucial as it synthesizes the input data into a coherent and meaningful response, whether it's answering a query, generating text, or making decisions.

  • Reasoning and Decision-Making Tasks: In this phase, the system undertakes complex reasoning and decision-making tasks. These tasks can complicate performance evaluations, as they require the model to not only generate text but also make logical decisions based on the context provided.

  • Cycle Between Generation Model and Reflective Model: In many implementations, before the result is returned from the Generation Model it is reviewed for errors by Reflective Model or ‘quality model’. Reflective Model could be a more powerful  version of (or exactly the same model as) the Generation model, taking advantage of LLMs reflective property to find errors and inconsistencies when that is the focus of the prompt.

How Okareo Can Help

What's next?

Interested in building and evaluating a RAG? Give Okareo a try and follow our documentation to get started.

RAG is fast becoming an industry standard, especially when users need real-time access to large amounts of dynamic data. Today, many of the largest AI chatbots (including ChatGPT and Bard) use some variation of RAG architecture, and developers of LLM-powered apps are also choosing to use it.

But RAG isn't just about "slapping a Vector DB on an LLM"; integrating a RAG system into a real world app involves many technical decisions. Here, we define RAG architecture — explaining how the key components work and fit together.

What is RAG and when to use it?

Fundamentally, RAG is an architecture that is used to improve the performance of foundational LLMs, and is worth using if you're a developer of LLM-powered apps. 

RAG augments what a generative LLM can do by including a retrieval mechanism that fetches relevant data from external, specialized sources, which is then fed into the LLM (along with the original query) for added context. This enhances accuracy and contextual relevance of the responses from generative models and reduces hallucinations. It also opens the results to external sources including real-time data unavailable in the LLM training data which usually has a cut-off date.

A key principle of RAG is that it separates the LLM's emerging reasoning abilities (such as decision-making, planning, and problem detection) from the factual data it was trained on.

Since RAG is a general architecture, there are many ways to apply it, which explains its popularity. Some common use cases are:

  • Question Answering - This is the most obvious and best publicized use case for RAG out there. This could be a feature in a larger FAQ, user help base or knowledge app. RAG allows for multiple sources of information to be used to synthesize answers, including domain-specific data (e.g. proprietary information about clinical trials, or engineering documentation about a specific product) that might not have been part of the LLM's more general original training data.

  • Chatbot/Co-Pilot - Previous generations of chatbots were often rule-based and scripted. Adding RAG allows conversational interaction with your domain specific knowledge, including nuances of dialog context and history.

  • Enterprise Search - Traditionally this was done via keyword indexing approaches (e.g Elasticsearch). Now RAG brings more relevant results with semantic retrieval and ability to answer more complex queries by correlating several data sources.

  • Agent Knowledge - In agentic systems, individual agents perform tasks based on external environment knowledge. An agent responsible for booking airline tickets can now use RAG to pull data from external knowledge bases, such as the current reservations database or the latest policy documents store, and use this information to decide how to handle a particular request.

For a detailed real-world example of a RAG system in use, read about how LinkedIn created a RAG system with multiple agents using RAG to retrieve and synthesize information from various sources including real-time data from the LinkedIn and Bing APIs.

What is the architecture of a RAG system?

Here's a high-level architecture I created for a typical RAG system. Of course, there’s no such thing as a "typical RAG"—it can be much more complex and tailored to specific use cases. Nonetheless, this diagram serves a purpose, as the core phases and concepts shown below are what we consistently see across many production deployments.

It's likely if you're using a RAG architecture that some of your data will be in a vector DB. You're also considering how to connect it to an LLM. But you likely also have other data relating to your app that's not in a vector database and it may never need to be. This begs the question, how do you leverage RAG with the rest of your app, microservices or data? Considering the range of use cases above, how do you determine which data sources are relevant to a given request?

Questions like this are why intent detection and routing has now become a key component of many RAG systems. The diagram above shows that there are three core phrases of RAG:

  1. Intent Detection + Routing - Identifying which data sources or microservices are needed to fulfill an incoming request — this is your RAG entry point.

  2. Retrieval - Finding relevant data from one or more of the identified data sources.

  3. Generation - Synthesizing the initial intent or goal (autonomous systems often have requests without user input) into a result using retrieved data.

Each of these three phases is backed by a model that you need to compose specifically for that purpose, and problems with any of these phases will translate into downstream failures. That's why it's important to evaluate each phase of your RAG separately.

Something No One Tells You:When evaluating your RAG models, the key dimension to evaluate against is data, both in volume and diversity. But not just any data. Your RAG performance metrics are only meaningful if you evaluate them using your own data, use cases, and scenarios.

With this context, we’ll dive into each RAG phase next.

Intent Detection and Routing

Intent detection and routing  involves understanding the intent or goal behind a RAG input, classifying it into one of several predefined categories and then routing the input into an appropriate system or data store.

Intent detection can also be used to filter out off-topic, out of scope or malicious queries, or to ask for user clarification when a query is ambiguous.

A classification model is typically used for intent direction, but it's worth considering using a smaller, more efficient classification model for this to minimize latency and operating costs. What you may lack in precision with a smaller model can often be made up for by using custom routing logic to further refine how the system directs queries to the appropriate data sources.

Once your routing system has determined the correct data store or API for the retrieval phase, sometimes the query needs to be reformulated to match the structure or language expected by that system. This involves:

  • Query Extraction: Extracting specific parameters from the initial input that are crucial for the retrieval phase. For example, this could be date ranges or product names, if data sources are split by product.

  • Query Decomposition and Rewriting: Breaking up or rewriting the query to produce candidates more likely to get relevant results in retrieval.

How Okareo Can Help

  • Classification: Okareo provides tools that evaluate the classification of user intents, ensuring that the system correctly identifies and categorizes requests. Learn more about Okareo's classification evals.

  • Using Synthetic Data to Improve Test Coverage and Performance: To enhance the accuracy and reliability of intent detection, Okareo offers synthetic data generation. This helps improve test coverage and overall system performance. Read more on how synthetic data can help.

  • Intent Detection Fine-Tuning: Okareo also supports fine-tuning models for intent detection, allowing you to further optimize understanding and routing of specific types of queries. Explore fine-tuning options.

Retrieval

The retrieval phase of RAG involves retrieving a list of relevant data points from an external source, and then ranking which of these are the most relevant, so that later, the most relevant data can be sent to your generative model along with the original input. This improves the accuracy, relevance and depth of the response from the LLM.

The most commonly used tool in the retrieval process is the vector database, but many other data sources can also be used for this, including Graph DBs, SQL/NoSQL DBs, keyword search engines like Elasticsearch or external APIs.

In this article, we’re going to focus on vector based retrieval, but other methods deserve equal consideration. 

1st Stage - Retrieve

In this stage, a broad set of potentially relevant data points is retrieved based on the input query. This is often done using fast, basic vector search across a vector database to quickly narrow down a large dataset.

  • Vector database: Your data can be stored as vector embeddings in a vector database, which has built-in algorithms for similarity search. You use an embedding model to convert text into vector embeddings before saving them in the vector DB.

  • Embedding model: The RAG input is also converted to a vector embedding via your embedding model, which can then be used to query the vector database for other similar vector embeddings. The vector database uses a similarity search algorithm to find the closest matches, and returns a list of them. Large embedding models can be one of the most expensive components in a RAG system,. So it's worth noting that many smaller models are quite powerful and can be ‘good enough’ in terms of latency and retrieval performance.

  • Vector embedding: This is a type of vector, which is a mathematical way to represent data that makes it easy to compare similarity with another vector.

  • Querying the vector DB: The user query is converted to a vector embedding and sent to the vector DB, which uses a similarity search algorithm to find the closest matches. It then returns a list of the closest matches.

2nd Stage - ReRanking

After the initial set of results is gathered, a more powerful, but slower ReRanker model is applied to reorder the results based on their relevance to the query. ReRanker models are more computationally expensive, but usually more accurate than embedding models. Hence leaving them for the second stage to operate on a smaller set of narrowed results.

How Okareo Can Help

  • Explore Okareo's retrieval evaluation: Due to intrinsic complexity of retrieval, there is a need for performance evaluation dedicated to the retrieval phase of RAG. If you can’t isolate and trap problems here, they cascade into downstream failures.

  • Optimizing For Your Data: Okareo provides guidance on selecting an embedding model that fits your data. Learn more about embedding model selection.

  • Synthetic Generation of Evaluation Data: Build evaluations for Retrieve and ReRanking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.

Generation

In the Generation phase, the system takes the relevant data retrieved in the previous steps and uses it to generate the final output. This step is crucial as it synthesizes the input data into a coherent and meaningful response, whether it's answering a query, generating text, or making decisions.

  • Reasoning and Decision-Making Tasks: In this phase, the system undertakes complex reasoning and decision-making tasks. These tasks can complicate performance evaluations, as they require the model to not only generate text but also make logical decisions based on the context provided.

  • Cycle Between Generation Model and Reflective Model: In many implementations, before the result is returned from the Generation Model it is reviewed for errors by Reflective Model or ‘quality model’. Reflective Model could be a more powerful  version of (or exactly the same model as) the Generation model, taking advantage of LLMs reflective property to find errors and inconsistencies when that is the focus of the prompt.

How Okareo Can Help

What's next?

Interested in building and evaluating a RAG? Give Okareo a try and follow our documentation to get started.

RAG is fast becoming an industry standard, especially when users need real-time access to large amounts of dynamic data. Today, many of the largest AI chatbots (including ChatGPT and Bard) use some variation of RAG architecture, and developers of LLM-powered apps are also choosing to use it.

But RAG isn't just about "slapping a Vector DB on an LLM"; integrating a RAG system into a real world app involves many technical decisions. Here, we define RAG architecture — explaining how the key components work and fit together.

What is RAG and when to use it?

Fundamentally, RAG is an architecture that is used to improve the performance of foundational LLMs, and is worth using if you're a developer of LLM-powered apps. 

RAG augments what a generative LLM can do by including a retrieval mechanism that fetches relevant data from external, specialized sources, which is then fed into the LLM (along with the original query) for added context. This enhances accuracy and contextual relevance of the responses from generative models and reduces hallucinations. It also opens the results to external sources including real-time data unavailable in the LLM training data which usually has a cut-off date.

A key principle of RAG is that it separates the LLM's emerging reasoning abilities (such as decision-making, planning, and problem detection) from the factual data it was trained on.

Since RAG is a general architecture, there are many ways to apply it, which explains its popularity. Some common use cases are:

  • Question Answering - This is the most obvious and best publicized use case for RAG out there. This could be a feature in a larger FAQ, user help base or knowledge app. RAG allows for multiple sources of information to be used to synthesize answers, including domain-specific data (e.g. proprietary information about clinical trials, or engineering documentation about a specific product) that might not have been part of the LLM's more general original training data.

  • Chatbot/Co-Pilot - Previous generations of chatbots were often rule-based and scripted. Adding RAG allows conversational interaction with your domain specific knowledge, including nuances of dialog context and history.

  • Enterprise Search - Traditionally this was done via keyword indexing approaches (e.g Elasticsearch). Now RAG brings more relevant results with semantic retrieval and ability to answer more complex queries by correlating several data sources.

  • Agent Knowledge - In agentic systems, individual agents perform tasks based on external environment knowledge. An agent responsible for booking airline tickets can now use RAG to pull data from external knowledge bases, such as the current reservations database or the latest policy documents store, and use this information to decide how to handle a particular request.

For a detailed real-world example of a RAG system in use, read about how LinkedIn created a RAG system with multiple agents using RAG to retrieve and synthesize information from various sources including real-time data from the LinkedIn and Bing APIs.

What is the architecture of a RAG system?

Here's a high-level architecture I created for a typical RAG system. Of course, there’s no such thing as a "typical RAG"—it can be much more complex and tailored to specific use cases. Nonetheless, this diagram serves a purpose, as the core phases and concepts shown below are what we consistently see across many production deployments.

It's likely if you're using a RAG architecture that some of your data will be in a vector DB. You're also considering how to connect it to an LLM. But you likely also have other data relating to your app that's not in a vector database and it may never need to be. This begs the question, how do you leverage RAG with the rest of your app, microservices or data? Considering the range of use cases above, how do you determine which data sources are relevant to a given request?

Questions like this are why intent detection and routing has now become a key component of many RAG systems. The diagram above shows that there are three core phrases of RAG:

  1. Intent Detection + Routing - Identifying which data sources or microservices are needed to fulfill an incoming request — this is your RAG entry point.

  2. Retrieval - Finding relevant data from one or more of the identified data sources.

  3. Generation - Synthesizing the initial intent or goal (autonomous systems often have requests without user input) into a result using retrieved data.

Each of these three phases is backed by a model that you need to compose specifically for that purpose, and problems with any of these phases will translate into downstream failures. That's why it's important to evaluate each phase of your RAG separately.

Something No One Tells You:When evaluating your RAG models, the key dimension to evaluate against is data, both in volume and diversity. But not just any data. Your RAG performance metrics are only meaningful if you evaluate them using your own data, use cases, and scenarios.

With this context, we’ll dive into each RAG phase next.

Intent Detection and Routing

Intent detection and routing  involves understanding the intent or goal behind a RAG input, classifying it into one of several predefined categories and then routing the input into an appropriate system or data store.

Intent detection can also be used to filter out off-topic, out of scope or malicious queries, or to ask for user clarification when a query is ambiguous.

A classification model is typically used for intent direction, but it's worth considering using a smaller, more efficient classification model for this to minimize latency and operating costs. What you may lack in precision with a smaller model can often be made up for by using custom routing logic to further refine how the system directs queries to the appropriate data sources.

Once your routing system has determined the correct data store or API for the retrieval phase, sometimes the query needs to be reformulated to match the structure or language expected by that system. This involves:

  • Query Extraction: Extracting specific parameters from the initial input that are crucial for the retrieval phase. For example, this could be date ranges or product names, if data sources are split by product.

  • Query Decomposition and Rewriting: Breaking up or rewriting the query to produce candidates more likely to get relevant results in retrieval.

How Okareo Can Help

  • Classification: Okareo provides tools that evaluate the classification of user intents, ensuring that the system correctly identifies and categorizes requests. Learn more about Okareo's classification evals.

  • Using Synthetic Data to Improve Test Coverage and Performance: To enhance the accuracy and reliability of intent detection, Okareo offers synthetic data generation. This helps improve test coverage and overall system performance. Read more on how synthetic data can help.

  • Intent Detection Fine-Tuning: Okareo also supports fine-tuning models for intent detection, allowing you to further optimize understanding and routing of specific types of queries. Explore fine-tuning options.

Retrieval

The retrieval phase of RAG involves retrieving a list of relevant data points from an external source, and then ranking which of these are the most relevant, so that later, the most relevant data can be sent to your generative model along with the original input. This improves the accuracy, relevance and depth of the response from the LLM.

The most commonly used tool in the retrieval process is the vector database, but many other data sources can also be used for this, including Graph DBs, SQL/NoSQL DBs, keyword search engines like Elasticsearch or external APIs.

In this article, we’re going to focus on vector based retrieval, but other methods deserve equal consideration. 

1st Stage - Retrieve

In this stage, a broad set of potentially relevant data points is retrieved based on the input query. This is often done using fast, basic vector search across a vector database to quickly narrow down a large dataset.

  • Vector database: Your data can be stored as vector embeddings in a vector database, which has built-in algorithms for similarity search. You use an embedding model to convert text into vector embeddings before saving them in the vector DB.

  • Embedding model: The RAG input is also converted to a vector embedding via your embedding model, which can then be used to query the vector database for other similar vector embeddings. The vector database uses a similarity search algorithm to find the closest matches, and returns a list of them. Large embedding models can be one of the most expensive components in a RAG system,. So it's worth noting that many smaller models are quite powerful and can be ‘good enough’ in terms of latency and retrieval performance.

  • Vector embedding: This is a type of vector, which is a mathematical way to represent data that makes it easy to compare similarity with another vector.

  • Querying the vector DB: The user query is converted to a vector embedding and sent to the vector DB, which uses a similarity search algorithm to find the closest matches. It then returns a list of the closest matches.

2nd Stage - ReRanking

After the initial set of results is gathered, a more powerful, but slower ReRanker model is applied to reorder the results based on their relevance to the query. ReRanker models are more computationally expensive, but usually more accurate than embedding models. Hence leaving them for the second stage to operate on a smaller set of narrowed results.

How Okareo Can Help

  • Explore Okareo's retrieval evaluation: Due to intrinsic complexity of retrieval, there is a need for performance evaluation dedicated to the retrieval phase of RAG. If you can’t isolate and trap problems here, they cascade into downstream failures.

  • Optimizing For Your Data: Okareo provides guidance on selecting an embedding model that fits your data. Learn more about embedding model selection.

  • Synthetic Generation of Evaluation Data: Build evaluations for Retrieve and ReRanking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.

Generation

In the Generation phase, the system takes the relevant data retrieved in the previous steps and uses it to generate the final output. This step is crucial as it synthesizes the input data into a coherent and meaningful response, whether it's answering a query, generating text, or making decisions.

  • Reasoning and Decision-Making Tasks: In this phase, the system undertakes complex reasoning and decision-making tasks. These tasks can complicate performance evaluations, as they require the model to not only generate text but also make logical decisions based on the context provided.

  • Cycle Between Generation Model and Reflective Model: In many implementations, before the result is returned from the Generation Model it is reviewed for errors by Reflective Model or ‘quality model’. Reflective Model could be a more powerful  version of (or exactly the same model as) the Generation model, taking advantage of LLMs reflective property to find errors and inconsistencies when that is the focus of the prompt.

How Okareo Can Help

What's next?

Interested in building and evaluating a RAG? Give Okareo a try and follow our documentation to get started.

RAG is fast becoming an industry standard, especially when users need real-time access to large amounts of dynamic data. Today, many of the largest AI chatbots (including ChatGPT and Bard) use some variation of RAG architecture, and developers of LLM-powered apps are also choosing to use it.

But RAG isn't just about "slapping a Vector DB on an LLM"; integrating a RAG system into a real world app involves many technical decisions. Here, we define RAG architecture — explaining how the key components work and fit together.

What is RAG and when to use it?

Fundamentally, RAG is an architecture that is used to improve the performance of foundational LLMs, and is worth using if you're a developer of LLM-powered apps. 

RAG augments what a generative LLM can do by including a retrieval mechanism that fetches relevant data from external, specialized sources, which is then fed into the LLM (along with the original query) for added context. This enhances accuracy and contextual relevance of the responses from generative models and reduces hallucinations. It also opens the results to external sources including real-time data unavailable in the LLM training data which usually has a cut-off date.

A key principle of RAG is that it separates the LLM's emerging reasoning abilities (such as decision-making, planning, and problem detection) from the factual data it was trained on.

Since RAG is a general architecture, there are many ways to apply it, which explains its popularity. Some common use cases are:

  • Question Answering - This is the most obvious and best publicized use case for RAG out there. This could be a feature in a larger FAQ, user help base or knowledge app. RAG allows for multiple sources of information to be used to synthesize answers, including domain-specific data (e.g. proprietary information about clinical trials, or engineering documentation about a specific product) that might not have been part of the LLM's more general original training data.

  • Chatbot/Co-Pilot - Previous generations of chatbots were often rule-based and scripted. Adding RAG allows conversational interaction with your domain specific knowledge, including nuances of dialog context and history.

  • Enterprise Search - Traditionally this was done via keyword indexing approaches (e.g Elasticsearch). Now RAG brings more relevant results with semantic retrieval and ability to answer more complex queries by correlating several data sources.

  • Agent Knowledge - In agentic systems, individual agents perform tasks based on external environment knowledge. An agent responsible for booking airline tickets can now use RAG to pull data from external knowledge bases, such as the current reservations database or the latest policy documents store, and use this information to decide how to handle a particular request.

For a detailed real-world example of a RAG system in use, read about how LinkedIn created a RAG system with multiple agents using RAG to retrieve and synthesize information from various sources including real-time data from the LinkedIn and Bing APIs.

What is the architecture of a RAG system?

Here's a high-level architecture I created for a typical RAG system. Of course, there’s no such thing as a "typical RAG"—it can be much more complex and tailored to specific use cases. Nonetheless, this diagram serves a purpose, as the core phases and concepts shown below are what we consistently see across many production deployments.

It's likely if you're using a RAG architecture that some of your data will be in a vector DB. You're also considering how to connect it to an LLM. But you likely also have other data relating to your app that's not in a vector database and it may never need to be. This begs the question, how do you leverage RAG with the rest of your app, microservices or data? Considering the range of use cases above, how do you determine which data sources are relevant to a given request?

Questions like this are why intent detection and routing has now become a key component of many RAG systems. The diagram above shows that there are three core phrases of RAG:

  1. Intent Detection + Routing - Identifying which data sources or microservices are needed to fulfill an incoming request — this is your RAG entry point.

  2. Retrieval - Finding relevant data from one or more of the identified data sources.

  3. Generation - Synthesizing the initial intent or goal (autonomous systems often have requests without user input) into a result using retrieved data.

Each of these three phases is backed by a model that you need to compose specifically for that purpose, and problems with any of these phases will translate into downstream failures. That's why it's important to evaluate each phase of your RAG separately.

Something No One Tells You:When evaluating your RAG models, the key dimension to evaluate against is data, both in volume and diversity. But not just any data. Your RAG performance metrics are only meaningful if you evaluate them using your own data, use cases, and scenarios.

With this context, we’ll dive into each RAG phase next.

Intent Detection and Routing

Intent detection and routing  involves understanding the intent or goal behind a RAG input, classifying it into one of several predefined categories and then routing the input into an appropriate system or data store.

Intent detection can also be used to filter out off-topic, out of scope or malicious queries, or to ask for user clarification when a query is ambiguous.

A classification model is typically used for intent direction, but it's worth considering using a smaller, more efficient classification model for this to minimize latency and operating costs. What you may lack in precision with a smaller model can often be made up for by using custom routing logic to further refine how the system directs queries to the appropriate data sources.

Once your routing system has determined the correct data store or API for the retrieval phase, sometimes the query needs to be reformulated to match the structure or language expected by that system. This involves:

  • Query Extraction: Extracting specific parameters from the initial input that are crucial for the retrieval phase. For example, this could be date ranges or product names, if data sources are split by product.

  • Query Decomposition and Rewriting: Breaking up or rewriting the query to produce candidates more likely to get relevant results in retrieval.

How Okareo Can Help

  • Classification: Okareo provides tools that evaluate the classification of user intents, ensuring that the system correctly identifies and categorizes requests. Learn more about Okareo's classification evals.

  • Using Synthetic Data to Improve Test Coverage and Performance: To enhance the accuracy and reliability of intent detection, Okareo offers synthetic data generation. This helps improve test coverage and overall system performance. Read more on how synthetic data can help.

  • Intent Detection Fine-Tuning: Okareo also supports fine-tuning models for intent detection, allowing you to further optimize understanding and routing of specific types of queries. Explore fine-tuning options.

Retrieval

The retrieval phase of RAG involves retrieving a list of relevant data points from an external source, and then ranking which of these are the most relevant, so that later, the most relevant data can be sent to your generative model along with the original input. This improves the accuracy, relevance and depth of the response from the LLM.

The most commonly used tool in the retrieval process is the vector database, but many other data sources can also be used for this, including Graph DBs, SQL/NoSQL DBs, keyword search engines like Elasticsearch or external APIs.

In this article, we’re going to focus on vector based retrieval, but other methods deserve equal consideration. 

1st Stage - Retrieve

In this stage, a broad set of potentially relevant data points is retrieved based on the input query. This is often done using fast, basic vector search across a vector database to quickly narrow down a large dataset.

  • Vector database: Your data can be stored as vector embeddings in a vector database, which has built-in algorithms for similarity search. You use an embedding model to convert text into vector embeddings before saving them in the vector DB.

  • Embedding model: The RAG input is also converted to a vector embedding via your embedding model, which can then be used to query the vector database for other similar vector embeddings. The vector database uses a similarity search algorithm to find the closest matches, and returns a list of them. Large embedding models can be one of the most expensive components in a RAG system,. So it's worth noting that many smaller models are quite powerful and can be ‘good enough’ in terms of latency and retrieval performance.

  • Vector embedding: This is a type of vector, which is a mathematical way to represent data that makes it easy to compare similarity with another vector.

  • Querying the vector DB: The user query is converted to a vector embedding and sent to the vector DB, which uses a similarity search algorithm to find the closest matches. It then returns a list of the closest matches.

2nd Stage - ReRanking

After the initial set of results is gathered, a more powerful, but slower ReRanker model is applied to reorder the results based on their relevance to the query. ReRanker models are more computationally expensive, but usually more accurate than embedding models. Hence leaving them for the second stage to operate on a smaller set of narrowed results.

How Okareo Can Help

  • Explore Okareo's retrieval evaluation: Due to intrinsic complexity of retrieval, there is a need for performance evaluation dedicated to the retrieval phase of RAG. If you can’t isolate and trap problems here, they cascade into downstream failures.

  • Optimizing For Your Data: Okareo provides guidance on selecting an embedding model that fits your data. Learn more about embedding model selection.

  • Synthetic Generation of Evaluation Data: Build evaluations for Retrieve and ReRanking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.

Generation

In the Generation phase, the system takes the relevant data retrieved in the previous steps and uses it to generate the final output. This step is crucial as it synthesizes the input data into a coherent and meaningful response, whether it's answering a query, generating text, or making decisions.

  • Reasoning and Decision-Making Tasks: In this phase, the system undertakes complex reasoning and decision-making tasks. These tasks can complicate performance evaluations, as they require the model to not only generate text but also make logical decisions based on the context provided.

  • Cycle Between Generation Model and Reflective Model: In many implementations, before the result is returned from the Generation Model it is reviewed for errors by Reflective Model or ‘quality model’. Reflective Model could be a more powerful  version of (or exactly the same model as) the Generation model, taking advantage of LLMs reflective property to find errors and inconsistencies when that is the focus of the prompt.

How Okareo Can Help

What's next?

Interested in building and evaluating a RAG? Give Okareo a try and follow our documentation to get started.

Share:

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.