Understanding RAG Architecture and Its Key Components
RAG

Boris Selitser
,
Co-founder of Okareo

Sarah Barber
,
Senior Technical Content Writer
August 21, 2024
Understanding RAG Architecture and Its Key Components
RAG is fast becoming an industry standard, especially when users need real-time access to large amounts of dynamic data. Today, many of the largest AI chatbots (including ChatGPT and Bard) use some variation of RAG architecture, and developers of LLM-powered apps are also choosing to use it.
But RAG isn't just about "slapping a Vector DB on an LLM." Integrating a RAG system into a real world app involves many technical decisions. Here, we define RAG architecture, explaining how the key components work and fit together.
What is RAG and when do you use it?
Fundamentally, RAG is an architecture that LLM-powered app developers use to improve the performance of foundational LLMs. RAG augments what a generative LLM can do by including a retrieval mechanism that fetches relevant data from external, specialized sources and feeds it into the LLM (along with the original query) for added context. This enhances accuracy and contextual relevance of the responses from generative models. It also reduces hallucinations, and the external data includes real-time data, unlike LLM training data, which usually has a cutoff date.
A key principle of RAG is that it separates the LLM's emerging reasoning abilities (such as decision-making, planning, and problem detection) from the factual data it was trained on.
Since RAG is a general architecture, there are many ways to apply it, which explains its popularity. Some common use cases are:
Question answering: This is the most obvious and best publicized use case for RAG. This could be a feature in a larger FAQ, user help base, or knowledge app. RAG allows multiple sources of information to be used to synthesize answers, including domain-specific data (for example, proprietary information about clinical trials or engineering documentation about a specific product) that might not have been part of the LLM's more general original training data.
Chatbot/Copilot: Previous generations of chatbots were often rule-based and scripted. Adding RAG allows conversational interaction with your domain specific knowledge, including nuances of dialog context and history.
Enterprise search: Traditionally, this was done via keyword-indexing approaches (such as Elasticsearch). Now RAG brings more relevant results with semantic retrieval and the ability to answer more complex queries by correlating several data sources.
Agent knowledge: In agentic systems, individual agents perform tasks based on external environment knowledge. An agent responsible for booking airline tickets can now use RAG to pull data from external knowledge bases, such as the current reservations database or the latest policy documents store, and use this information to decide how to handle a particular request.
For a detailed real-world example of a RAG system in use, read about how LinkedIn created a RAG system with multiple agents using RAG to retrieve and synthesize information from various sources, including real-time data from the LinkedIn and Bing APIs.
What is the architecture of a RAG system?
Here's a high-level architecture I created for a typical RAG system. Of course, there’s no such thing as a "typical” RAG — it can be much more complex and tailored to specific use cases. However, the core phases and concepts shown below are what we consistently see across many production deployments.

If you're using RAG architecture, it’s likely that some of your data will be in a vector DB and you're considering how to connect it to an LLM. But you may also have other data relating to your app that's not in a vector database, and it may never need to be. This begs the question, how do you leverage RAG with the rest of your app, microservices, or data? Considering the range of use cases above, how do you determine which data sources are relevant to a given request?
Questions like this are why intent detection and routing has now become a key component of many RAG systems. The diagram above shows that there are three core phrases of RAG:
Intent detection + routing: Identifying which data sources or microservices are needed to fulfill an incoming request — this is your RAG entry point.
Retrieval: Finding relevant data from one or more of the identified data sources.
Generation: Synthesizing the initial intent or goal (autonomous systems often have requests without user input) into a result using retrieved data.
Each of these three phases relies on a model that you need to compose specifically for that purpose, and problems with any of these phases will translate into downstream failures. That's why it's important to evaluate each phase of your RAG separately. You can also optimize your RAG by optimizing each phase in turn.
Something no one tells you: When evaluating your RAG models, the key dimension to evaluate against is data, both in volume and diversity. But RAG performance metrics are only meaningful if you evaluate them using your own data, use cases, and scenarios.
With this context, we’ll dive into each RAG phase next.
Intent Detection and Routing
Intent detection and routing involves understanding the intent or goal behind a RAG input, classifying it into one of several predefined categories, and then routing the input into an appropriate system or data store.
Alt text: Various queries getting routed by a RAG's intent detection phase.
Intent detection can also be used to filter out off-topic, out-of-scope, or malicious queries, or to ask for user clarification when a query is ambiguous.

A classification model is typically used for intent direction, but it's worth considering using a smaller, more efficient classification model to minimize latency and operating costs. What you may lack in precision with a smaller model can often be made up for by using custom routing logic to further refine how the system directs queries to the appropriate data sources.
Once your routing system has determined the correct data store or API for the retrieval phase, sometimes the query needs to be reformulated to match the structure or language expected by that system. This involves:
Query extraction: Extracting specific parameters from the initial input that are crucial for the retrieval phase. For example, this could be date ranges or product names if data sources are split by product.
Query decomposition and rewriting: Breaking up or rewriting the query to produce candidates more likely to get relevant results in retrieval.

How Okareo Can Help
Classification: Okareo provides tools that evaluate the classification of user intents, ensuring that the system correctly identifies and categorizes requests. Learn more about Okareo's classification evaluations.
Using synthetic data to improve test coverage and performance: To enhance the accuracy and reliability of intent detection, Okareo offers synthetic data generation. This helps improve test coverage and overall system performance. Read more on how synthetic data can help.
Intent detection fine-tuning: Okareo also supports fine-tuning models for intent detection, allowing you to further optimize understanding and routing of specific types of queries. Explore fine-tuning options.
Retrieval
The retrieval phase of RAG involves retrieving a list of relevant data points from an external source and then ranking which of these are the most relevant. This allows the most relevant data to later be sent to your generative model along with the original input. This improves the accuracy, relevance, and depth of the response from the LLM.
The most commonly used tool in the retrieval process is the vector database, but many other data sources can also be used for this, including Graph DBs, SQL/NoSQL DBs, keyword search engines like Elasticsearch, or external APIs.
In this article, we’re going to focus on vector-based retrieval, but other methods deserve equal consideration.
Stage 1: Retrieve
In this stage, the system retrieves a broad set of potentially relevant data points based on the input query. It often does this with a fast, basic vector search across a vector database to quickly narrow down a large dataset.
Vector database: Your data can be stored as vector embeddings in a vector database, which has built-in algorithms for similarity search. You use an embedding model to convert text into vector embeddings before saving them in the vector DB.
Embedding model: Your embedding model also converts the RAG input to a vector embedding, which can then be used to query the vector database for other similar vector embeddings. The vector database uses a similarity search algorithm to find the closest matches, and it returns a list of them. Large embedding models can be one of the most expensive components in a RAG system, so it's worth noting that many smaller models are quite powerful and can be “good enough” in terms of latency and retrieval performance.
Vector embedding: This is a type of vector, which is a mathematical way to represent data that makes it easy to compare similarity with another vector.
Querying the vector DB: The system converts the user query to a vector embedding and sends to the vector DB, which uses a similarity search algorithm to find the closest matches. It then returns a list of the closest matches.
Stage 2: Reranking
After the system gathers the initial set of results, it applies a more powerful but slower reranker model to reorder the results based on their relevance to the query. Reranker models are more computationally expensive, but they’re usually more accurate than embedding models. Hence, they are saved for the second stage to operate on a smaller set of narrowed results.
How Okareo Can Help
Explore Okareo's retrieval evaluation: Due to the intrinsic complexity of retrieval, the Retrieval phase of RAG requires a dedicated performance evaluation. If you can’t isolate and trap problems here, they cascade into downstream failures.
Optimizing for your data: Okareo provides guidance on selecting an embedding model that fits your data. Learn more about embedding model selection.
Synthetic generation of evaluation data: Okareo can build evaluations for the Retrieve and Reranking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.
Generation
In the Generation phase, the system takes the relevant data retrieved in the previous steps and uses it to generate the final output. This step is crucial, as it synthesizes the input data into a coherent and meaningful response, whether it's answering a query, generating text, or making decisions.
Reasoning and decision-making tasks: In this phase, the system undertakes complex reasoning and decision-making tasks. These tasks can complicate performance evaluations, as they require the model to not only generate text, but also make logical decisions based on the context provided.
Cycle between generation model and reflective model: In many implementations, before the generation model returns the result, a reflective model, or “quality model,” reviews the result for errors. A reflective model could be a more powerful version of (or exactly the same model as) the generation model, taking advantage of an LLM's reflective property to find errors and inconsistencies when that is the focus of the prompt.
How Okareo Can Help
Get started with generation evaluations: Okareo can evaluate how well your models are generating relevant and accurate outputs based on the context provided.
Scoring a generative model's output: With Okareo’s help, you can ensure the output of your generative models meets the quality standards required for your application.
Add LLM evaluation to your CI workflow: To maintain high-quality output as you develop and deploy your models, Okareo can integrate LLM evaluation into your continuous integration (CI) workflow.
What's next?
Interested in building and evaluating a RAG? Give Okareo a try and follow our documentation to get started.
Understanding RAG Architecture and Its Key Components
RAG is fast becoming an industry standard, especially when users need real-time access to large amounts of dynamic data. Today, many of the largest AI chatbots (including ChatGPT and Bard) use some variation of RAG architecture, and developers of LLM-powered apps are also choosing to use it.
But RAG isn't just about "slapping a Vector DB on an LLM." Integrating a RAG system into a real world app involves many technical decisions. Here, we define RAG architecture, explaining how the key components work and fit together.
What is RAG and when do you use it?
Fundamentally, RAG is an architecture that LLM-powered app developers use to improve the performance of foundational LLMs. RAG augments what a generative LLM can do by including a retrieval mechanism that fetches relevant data from external, specialized sources and feeds it into the LLM (along with the original query) for added context. This enhances accuracy and contextual relevance of the responses from generative models. It also reduces hallucinations, and the external data includes real-time data, unlike LLM training data, which usually has a cutoff date.
A key principle of RAG is that it separates the LLM's emerging reasoning abilities (such as decision-making, planning, and problem detection) from the factual data it was trained on.
Since RAG is a general architecture, there are many ways to apply it, which explains its popularity. Some common use cases are:
Question answering: This is the most obvious and best publicized use case for RAG. This could be a feature in a larger FAQ, user help base, or knowledge app. RAG allows multiple sources of information to be used to synthesize answers, including domain-specific data (for example, proprietary information about clinical trials or engineering documentation about a specific product) that might not have been part of the LLM's more general original training data.
Chatbot/Copilot: Previous generations of chatbots were often rule-based and scripted. Adding RAG allows conversational interaction with your domain specific knowledge, including nuances of dialog context and history.
Enterprise search: Traditionally, this was done via keyword-indexing approaches (such as Elasticsearch). Now RAG brings more relevant results with semantic retrieval and the ability to answer more complex queries by correlating several data sources.
Agent knowledge: In agentic systems, individual agents perform tasks based on external environment knowledge. An agent responsible for booking airline tickets can now use RAG to pull data from external knowledge bases, such as the current reservations database or the latest policy documents store, and use this information to decide how to handle a particular request.
For a detailed real-world example of a RAG system in use, read about how LinkedIn created a RAG system with multiple agents using RAG to retrieve and synthesize information from various sources, including real-time data from the LinkedIn and Bing APIs.
What is the architecture of a RAG system?
Here's a high-level architecture I created for a typical RAG system. Of course, there’s no such thing as a "typical” RAG — it can be much more complex and tailored to specific use cases. However, the core phases and concepts shown below are what we consistently see across many production deployments.

If you're using RAG architecture, it’s likely that some of your data will be in a vector DB and you're considering how to connect it to an LLM. But you may also have other data relating to your app that's not in a vector database, and it may never need to be. This begs the question, how do you leverage RAG with the rest of your app, microservices, or data? Considering the range of use cases above, how do you determine which data sources are relevant to a given request?
Questions like this are why intent detection and routing has now become a key component of many RAG systems. The diagram above shows that there are three core phrases of RAG:
Intent detection + routing: Identifying which data sources or microservices are needed to fulfill an incoming request — this is your RAG entry point.
Retrieval: Finding relevant data from one or more of the identified data sources.
Generation: Synthesizing the initial intent or goal (autonomous systems often have requests without user input) into a result using retrieved data.
Each of these three phases relies on a model that you need to compose specifically for that purpose, and problems with any of these phases will translate into downstream failures. That's why it's important to evaluate each phase of your RAG separately. You can also optimize your RAG by optimizing each phase in turn.
Something no one tells you: When evaluating your RAG models, the key dimension to evaluate against is data, both in volume and diversity. But RAG performance metrics are only meaningful if you evaluate them using your own data, use cases, and scenarios.
With this context, we’ll dive into each RAG phase next.
Intent Detection and Routing
Intent detection and routing involves understanding the intent or goal behind a RAG input, classifying it into one of several predefined categories, and then routing the input into an appropriate system or data store.
Alt text: Various queries getting routed by a RAG's intent detection phase.
Intent detection can also be used to filter out off-topic, out-of-scope, or malicious queries, or to ask for user clarification when a query is ambiguous.

A classification model is typically used for intent direction, but it's worth considering using a smaller, more efficient classification model to minimize latency and operating costs. What you may lack in precision with a smaller model can often be made up for by using custom routing logic to further refine how the system directs queries to the appropriate data sources.
Once your routing system has determined the correct data store or API for the retrieval phase, sometimes the query needs to be reformulated to match the structure or language expected by that system. This involves:
Query extraction: Extracting specific parameters from the initial input that are crucial for the retrieval phase. For example, this could be date ranges or product names if data sources are split by product.
Query decomposition and rewriting: Breaking up or rewriting the query to produce candidates more likely to get relevant results in retrieval.

How Okareo Can Help
Classification: Okareo provides tools that evaluate the classification of user intents, ensuring that the system correctly identifies and categorizes requests. Learn more about Okareo's classification evaluations.
Using synthetic data to improve test coverage and performance: To enhance the accuracy and reliability of intent detection, Okareo offers synthetic data generation. This helps improve test coverage and overall system performance. Read more on how synthetic data can help.
Intent detection fine-tuning: Okareo also supports fine-tuning models for intent detection, allowing you to further optimize understanding and routing of specific types of queries. Explore fine-tuning options.
Retrieval
The retrieval phase of RAG involves retrieving a list of relevant data points from an external source and then ranking which of these are the most relevant. This allows the most relevant data to later be sent to your generative model along with the original input. This improves the accuracy, relevance, and depth of the response from the LLM.
The most commonly used tool in the retrieval process is the vector database, but many other data sources can also be used for this, including Graph DBs, SQL/NoSQL DBs, keyword search engines like Elasticsearch, or external APIs.
In this article, we’re going to focus on vector-based retrieval, but other methods deserve equal consideration.
Stage 1: Retrieve
In this stage, the system retrieves a broad set of potentially relevant data points based on the input query. It often does this with a fast, basic vector search across a vector database to quickly narrow down a large dataset.
Vector database: Your data can be stored as vector embeddings in a vector database, which has built-in algorithms for similarity search. You use an embedding model to convert text into vector embeddings before saving them in the vector DB.
Embedding model: Your embedding model also converts the RAG input to a vector embedding, which can then be used to query the vector database for other similar vector embeddings. The vector database uses a similarity search algorithm to find the closest matches, and it returns a list of them. Large embedding models can be one of the most expensive components in a RAG system, so it's worth noting that many smaller models are quite powerful and can be “good enough” in terms of latency and retrieval performance.
Vector embedding: This is a type of vector, which is a mathematical way to represent data that makes it easy to compare similarity with another vector.
Querying the vector DB: The system converts the user query to a vector embedding and sends to the vector DB, which uses a similarity search algorithm to find the closest matches. It then returns a list of the closest matches.
Stage 2: Reranking
After the system gathers the initial set of results, it applies a more powerful but slower reranker model to reorder the results based on their relevance to the query. Reranker models are more computationally expensive, but they’re usually more accurate than embedding models. Hence, they are saved for the second stage to operate on a smaller set of narrowed results.
How Okareo Can Help
Explore Okareo's retrieval evaluation: Due to the intrinsic complexity of retrieval, the Retrieval phase of RAG requires a dedicated performance evaluation. If you can’t isolate and trap problems here, they cascade into downstream failures.
Optimizing for your data: Okareo provides guidance on selecting an embedding model that fits your data. Learn more about embedding model selection.
Synthetic generation of evaluation data: Okareo can build evaluations for the Retrieve and Reranking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.
Generation
In the Generation phase, the system takes the relevant data retrieved in the previous steps and uses it to generate the final output. This step is crucial, as it synthesizes the input data into a coherent and meaningful response, whether it's answering a query, generating text, or making decisions.
Reasoning and decision-making tasks: In this phase, the system undertakes complex reasoning and decision-making tasks. These tasks can complicate performance evaluations, as they require the model to not only generate text, but also make logical decisions based on the context provided.
Cycle between generation model and reflective model: In many implementations, before the generation model returns the result, a reflective model, or “quality model,” reviews the result for errors. A reflective model could be a more powerful version of (or exactly the same model as) the generation model, taking advantage of an LLM's reflective property to find errors and inconsistencies when that is the focus of the prompt.
How Okareo Can Help
Get started with generation evaluations: Okareo can evaluate how well your models are generating relevant and accurate outputs based on the context provided.
Scoring a generative model's output: With Okareo’s help, you can ensure the output of your generative models meets the quality standards required for your application.
Add LLM evaluation to your CI workflow: To maintain high-quality output as you develop and deploy your models, Okareo can integrate LLM evaluation into your continuous integration (CI) workflow.
What's next?
Interested in building and evaluating a RAG? Give Okareo a try and follow our documentation to get started.
Understanding RAG Architecture and Its Key Components
RAG is fast becoming an industry standard, especially when users need real-time access to large amounts of dynamic data. Today, many of the largest AI chatbots (including ChatGPT and Bard) use some variation of RAG architecture, and developers of LLM-powered apps are also choosing to use it.
But RAG isn't just about "slapping a Vector DB on an LLM." Integrating a RAG system into a real world app involves many technical decisions. Here, we define RAG architecture, explaining how the key components work and fit together.
What is RAG and when do you use it?
Fundamentally, RAG is an architecture that LLM-powered app developers use to improve the performance of foundational LLMs. RAG augments what a generative LLM can do by including a retrieval mechanism that fetches relevant data from external, specialized sources and feeds it into the LLM (along with the original query) for added context. This enhances accuracy and contextual relevance of the responses from generative models. It also reduces hallucinations, and the external data includes real-time data, unlike LLM training data, which usually has a cutoff date.
A key principle of RAG is that it separates the LLM's emerging reasoning abilities (such as decision-making, planning, and problem detection) from the factual data it was trained on.
Since RAG is a general architecture, there are many ways to apply it, which explains its popularity. Some common use cases are:
Question answering: This is the most obvious and best publicized use case for RAG. This could be a feature in a larger FAQ, user help base, or knowledge app. RAG allows multiple sources of information to be used to synthesize answers, including domain-specific data (for example, proprietary information about clinical trials or engineering documentation about a specific product) that might not have been part of the LLM's more general original training data.
Chatbot/Copilot: Previous generations of chatbots were often rule-based and scripted. Adding RAG allows conversational interaction with your domain specific knowledge, including nuances of dialog context and history.
Enterprise search: Traditionally, this was done via keyword-indexing approaches (such as Elasticsearch). Now RAG brings more relevant results with semantic retrieval and the ability to answer more complex queries by correlating several data sources.
Agent knowledge: In agentic systems, individual agents perform tasks based on external environment knowledge. An agent responsible for booking airline tickets can now use RAG to pull data from external knowledge bases, such as the current reservations database or the latest policy documents store, and use this information to decide how to handle a particular request.
For a detailed real-world example of a RAG system in use, read about how LinkedIn created a RAG system with multiple agents using RAG to retrieve and synthesize information from various sources, including real-time data from the LinkedIn and Bing APIs.
What is the architecture of a RAG system?
Here's a high-level architecture I created for a typical RAG system. Of course, there’s no such thing as a "typical” RAG — it can be much more complex and tailored to specific use cases. However, the core phases and concepts shown below are what we consistently see across many production deployments.

If you're using RAG architecture, it’s likely that some of your data will be in a vector DB and you're considering how to connect it to an LLM. But you may also have other data relating to your app that's not in a vector database, and it may never need to be. This begs the question, how do you leverage RAG with the rest of your app, microservices, or data? Considering the range of use cases above, how do you determine which data sources are relevant to a given request?
Questions like this are why intent detection and routing has now become a key component of many RAG systems. The diagram above shows that there are three core phrases of RAG:
Intent detection + routing: Identifying which data sources or microservices are needed to fulfill an incoming request — this is your RAG entry point.
Retrieval: Finding relevant data from one or more of the identified data sources.
Generation: Synthesizing the initial intent or goal (autonomous systems often have requests without user input) into a result using retrieved data.
Each of these three phases relies on a model that you need to compose specifically for that purpose, and problems with any of these phases will translate into downstream failures. That's why it's important to evaluate each phase of your RAG separately. You can also optimize your RAG by optimizing each phase in turn.
Something no one tells you: When evaluating your RAG models, the key dimension to evaluate against is data, both in volume and diversity. But RAG performance metrics are only meaningful if you evaluate them using your own data, use cases, and scenarios.
With this context, we’ll dive into each RAG phase next.
Intent Detection and Routing
Intent detection and routing involves understanding the intent or goal behind a RAG input, classifying it into one of several predefined categories, and then routing the input into an appropriate system or data store.
Alt text: Various queries getting routed by a RAG's intent detection phase.
Intent detection can also be used to filter out off-topic, out-of-scope, or malicious queries, or to ask for user clarification when a query is ambiguous.

A classification model is typically used for intent direction, but it's worth considering using a smaller, more efficient classification model to minimize latency and operating costs. What you may lack in precision with a smaller model can often be made up for by using custom routing logic to further refine how the system directs queries to the appropriate data sources.
Once your routing system has determined the correct data store or API for the retrieval phase, sometimes the query needs to be reformulated to match the structure or language expected by that system. This involves:
Query extraction: Extracting specific parameters from the initial input that are crucial for the retrieval phase. For example, this could be date ranges or product names if data sources are split by product.
Query decomposition and rewriting: Breaking up or rewriting the query to produce candidates more likely to get relevant results in retrieval.

How Okareo Can Help
Classification: Okareo provides tools that evaluate the classification of user intents, ensuring that the system correctly identifies and categorizes requests. Learn more about Okareo's classification evaluations.
Using synthetic data to improve test coverage and performance: To enhance the accuracy and reliability of intent detection, Okareo offers synthetic data generation. This helps improve test coverage and overall system performance. Read more on how synthetic data can help.
Intent detection fine-tuning: Okareo also supports fine-tuning models for intent detection, allowing you to further optimize understanding and routing of specific types of queries. Explore fine-tuning options.
Retrieval
The retrieval phase of RAG involves retrieving a list of relevant data points from an external source and then ranking which of these are the most relevant. This allows the most relevant data to later be sent to your generative model along with the original input. This improves the accuracy, relevance, and depth of the response from the LLM.
The most commonly used tool in the retrieval process is the vector database, but many other data sources can also be used for this, including Graph DBs, SQL/NoSQL DBs, keyword search engines like Elasticsearch, or external APIs.
In this article, we’re going to focus on vector-based retrieval, but other methods deserve equal consideration.
Stage 1: Retrieve
In this stage, the system retrieves a broad set of potentially relevant data points based on the input query. It often does this with a fast, basic vector search across a vector database to quickly narrow down a large dataset.
Vector database: Your data can be stored as vector embeddings in a vector database, which has built-in algorithms for similarity search. You use an embedding model to convert text into vector embeddings before saving them in the vector DB.
Embedding model: Your embedding model also converts the RAG input to a vector embedding, which can then be used to query the vector database for other similar vector embeddings. The vector database uses a similarity search algorithm to find the closest matches, and it returns a list of them. Large embedding models can be one of the most expensive components in a RAG system, so it's worth noting that many smaller models are quite powerful and can be “good enough” in terms of latency and retrieval performance.
Vector embedding: This is a type of vector, which is a mathematical way to represent data that makes it easy to compare similarity with another vector.
Querying the vector DB: The system converts the user query to a vector embedding and sends to the vector DB, which uses a similarity search algorithm to find the closest matches. It then returns a list of the closest matches.
Stage 2: Reranking
After the system gathers the initial set of results, it applies a more powerful but slower reranker model to reorder the results based on their relevance to the query. Reranker models are more computationally expensive, but they’re usually more accurate than embedding models. Hence, they are saved for the second stage to operate on a smaller set of narrowed results.
How Okareo Can Help
Explore Okareo's retrieval evaluation: Due to the intrinsic complexity of retrieval, the Retrieval phase of RAG requires a dedicated performance evaluation. If you can’t isolate and trap problems here, they cascade into downstream failures.
Optimizing for your data: Okareo provides guidance on selecting an embedding model that fits your data. Learn more about embedding model selection.
Synthetic generation of evaluation data: Okareo can build evaluations for the Retrieve and Reranking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.
Generation
In the Generation phase, the system takes the relevant data retrieved in the previous steps and uses it to generate the final output. This step is crucial, as it synthesizes the input data into a coherent and meaningful response, whether it's answering a query, generating text, or making decisions.
Reasoning and decision-making tasks: In this phase, the system undertakes complex reasoning and decision-making tasks. These tasks can complicate performance evaluations, as they require the model to not only generate text, but also make logical decisions based on the context provided.
Cycle between generation model and reflective model: In many implementations, before the generation model returns the result, a reflective model, or “quality model,” reviews the result for errors. A reflective model could be a more powerful version of (or exactly the same model as) the generation model, taking advantage of an LLM's reflective property to find errors and inconsistencies when that is the focus of the prompt.
How Okareo Can Help
Get started with generation evaluations: Okareo can evaluate how well your models are generating relevant and accurate outputs based on the context provided.
Scoring a generative model's output: With Okareo’s help, you can ensure the output of your generative models meets the quality standards required for your application.
Add LLM evaluation to your CI workflow: To maintain high-quality output as you develop and deploy your models, Okareo can integrate LLM evaluation into your continuous integration (CI) workflow.
What's next?
Interested in building and evaluating a RAG? Give Okareo a try and follow our documentation to get started.
Understanding RAG Architecture and Its Key Components
RAG is fast becoming an industry standard, especially when users need real-time access to large amounts of dynamic data. Today, many of the largest AI chatbots (including ChatGPT and Bard) use some variation of RAG architecture, and developers of LLM-powered apps are also choosing to use it.
But RAG isn't just about "slapping a Vector DB on an LLM." Integrating a RAG system into a real world app involves many technical decisions. Here, we define RAG architecture, explaining how the key components work and fit together.
What is RAG and when do you use it?
Fundamentally, RAG is an architecture that LLM-powered app developers use to improve the performance of foundational LLMs. RAG augments what a generative LLM can do by including a retrieval mechanism that fetches relevant data from external, specialized sources and feeds it into the LLM (along with the original query) for added context. This enhances accuracy and contextual relevance of the responses from generative models. It also reduces hallucinations, and the external data includes real-time data, unlike LLM training data, which usually has a cutoff date.
A key principle of RAG is that it separates the LLM's emerging reasoning abilities (such as decision-making, planning, and problem detection) from the factual data it was trained on.
Since RAG is a general architecture, there are many ways to apply it, which explains its popularity. Some common use cases are:
Question answering: This is the most obvious and best publicized use case for RAG. This could be a feature in a larger FAQ, user help base, or knowledge app. RAG allows multiple sources of information to be used to synthesize answers, including domain-specific data (for example, proprietary information about clinical trials or engineering documentation about a specific product) that might not have been part of the LLM's more general original training data.
Chatbot/Copilot: Previous generations of chatbots were often rule-based and scripted. Adding RAG allows conversational interaction with your domain specific knowledge, including nuances of dialog context and history.
Enterprise search: Traditionally, this was done via keyword-indexing approaches (such as Elasticsearch). Now RAG brings more relevant results with semantic retrieval and the ability to answer more complex queries by correlating several data sources.
Agent knowledge: In agentic systems, individual agents perform tasks based on external environment knowledge. An agent responsible for booking airline tickets can now use RAG to pull data from external knowledge bases, such as the current reservations database or the latest policy documents store, and use this information to decide how to handle a particular request.
For a detailed real-world example of a RAG system in use, read about how LinkedIn created a RAG system with multiple agents using RAG to retrieve and synthesize information from various sources, including real-time data from the LinkedIn and Bing APIs.
What is the architecture of a RAG system?
Here's a high-level architecture I created for a typical RAG system. Of course, there’s no such thing as a "typical” RAG — it can be much more complex and tailored to specific use cases. However, the core phases and concepts shown below are what we consistently see across many production deployments.

If you're using RAG architecture, it’s likely that some of your data will be in a vector DB and you're considering how to connect it to an LLM. But you may also have other data relating to your app that's not in a vector database, and it may never need to be. This begs the question, how do you leverage RAG with the rest of your app, microservices, or data? Considering the range of use cases above, how do you determine which data sources are relevant to a given request?
Questions like this are why intent detection and routing has now become a key component of many RAG systems. The diagram above shows that there are three core phrases of RAG:
Intent detection + routing: Identifying which data sources or microservices are needed to fulfill an incoming request — this is your RAG entry point.
Retrieval: Finding relevant data from one or more of the identified data sources.
Generation: Synthesizing the initial intent or goal (autonomous systems often have requests without user input) into a result using retrieved data.
Each of these three phases relies on a model that you need to compose specifically for that purpose, and problems with any of these phases will translate into downstream failures. That's why it's important to evaluate each phase of your RAG separately. You can also optimize your RAG by optimizing each phase in turn.
Something no one tells you: When evaluating your RAG models, the key dimension to evaluate against is data, both in volume and diversity. But RAG performance metrics are only meaningful if you evaluate them using your own data, use cases, and scenarios.
With this context, we’ll dive into each RAG phase next.
Intent Detection and Routing
Intent detection and routing involves understanding the intent or goal behind a RAG input, classifying it into one of several predefined categories, and then routing the input into an appropriate system or data store.
Alt text: Various queries getting routed by a RAG's intent detection phase.
Intent detection can also be used to filter out off-topic, out-of-scope, or malicious queries, or to ask for user clarification when a query is ambiguous.

A classification model is typically used for intent direction, but it's worth considering using a smaller, more efficient classification model to minimize latency and operating costs. What you may lack in precision with a smaller model can often be made up for by using custom routing logic to further refine how the system directs queries to the appropriate data sources.
Once your routing system has determined the correct data store or API for the retrieval phase, sometimes the query needs to be reformulated to match the structure or language expected by that system. This involves:
Query extraction: Extracting specific parameters from the initial input that are crucial for the retrieval phase. For example, this could be date ranges or product names if data sources are split by product.
Query decomposition and rewriting: Breaking up or rewriting the query to produce candidates more likely to get relevant results in retrieval.

How Okareo Can Help
Classification: Okareo provides tools that evaluate the classification of user intents, ensuring that the system correctly identifies and categorizes requests. Learn more about Okareo's classification evaluations.
Using synthetic data to improve test coverage and performance: To enhance the accuracy and reliability of intent detection, Okareo offers synthetic data generation. This helps improve test coverage and overall system performance. Read more on how synthetic data can help.
Intent detection fine-tuning: Okareo also supports fine-tuning models for intent detection, allowing you to further optimize understanding and routing of specific types of queries. Explore fine-tuning options.
Retrieval
The retrieval phase of RAG involves retrieving a list of relevant data points from an external source and then ranking which of these are the most relevant. This allows the most relevant data to later be sent to your generative model along with the original input. This improves the accuracy, relevance, and depth of the response from the LLM.
The most commonly used tool in the retrieval process is the vector database, but many other data sources can also be used for this, including Graph DBs, SQL/NoSQL DBs, keyword search engines like Elasticsearch, or external APIs.
In this article, we’re going to focus on vector-based retrieval, but other methods deserve equal consideration.
Stage 1: Retrieve
In this stage, the system retrieves a broad set of potentially relevant data points based on the input query. It often does this with a fast, basic vector search across a vector database to quickly narrow down a large dataset.
Vector database: Your data can be stored as vector embeddings in a vector database, which has built-in algorithms for similarity search. You use an embedding model to convert text into vector embeddings before saving them in the vector DB.
Embedding model: Your embedding model also converts the RAG input to a vector embedding, which can then be used to query the vector database for other similar vector embeddings. The vector database uses a similarity search algorithm to find the closest matches, and it returns a list of them. Large embedding models can be one of the most expensive components in a RAG system, so it's worth noting that many smaller models are quite powerful and can be “good enough” in terms of latency and retrieval performance.
Vector embedding: This is a type of vector, which is a mathematical way to represent data that makes it easy to compare similarity with another vector.
Querying the vector DB: The system converts the user query to a vector embedding and sends to the vector DB, which uses a similarity search algorithm to find the closest matches. It then returns a list of the closest matches.
Stage 2: Reranking
After the system gathers the initial set of results, it applies a more powerful but slower reranker model to reorder the results based on their relevance to the query. Reranker models are more computationally expensive, but they’re usually more accurate than embedding models. Hence, they are saved for the second stage to operate on a smaller set of narrowed results.
How Okareo Can Help
Explore Okareo's retrieval evaluation: Due to the intrinsic complexity of retrieval, the Retrieval phase of RAG requires a dedicated performance evaluation. If you can’t isolate and trap problems here, they cascade into downstream failures.
Optimizing for your data: Okareo provides guidance on selecting an embedding model that fits your data. Learn more about embedding model selection.
Synthetic generation of evaluation data: Okareo can build evaluations for the Retrieve and Reranking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.
Generation
In the Generation phase, the system takes the relevant data retrieved in the previous steps and uses it to generate the final output. This step is crucial, as it synthesizes the input data into a coherent and meaningful response, whether it's answering a query, generating text, or making decisions.
Reasoning and decision-making tasks: In this phase, the system undertakes complex reasoning and decision-making tasks. These tasks can complicate performance evaluations, as they require the model to not only generate text, but also make logical decisions based on the context provided.
Cycle between generation model and reflective model: In many implementations, before the generation model returns the result, a reflective model, or “quality model,” reviews the result for errors. A reflective model could be a more powerful version of (or exactly the same model as) the generation model, taking advantage of an LLM's reflective property to find errors and inconsistencies when that is the focus of the prompt.
How Okareo Can Help
Get started with generation evaluations: Okareo can evaluate how well your models are generating relevant and accurate outputs based on the context provided.
Scoring a generative model's output: With Okareo’s help, you can ensure the output of your generative models meets the quality standards required for your application.
Add LLM evaluation to your CI workflow: To maintain high-quality output as you develop and deploy your models, Okareo can integrate LLM evaluation into your continuous integration (CI) workflow.
What's next?
Interested in building and evaluating a RAG? Give Okareo a try and follow our documentation to get started.