RAG Optimization: Techniques to Make your RAG Faster, Cheaper, and More Accurate

Optimizing a Retrieval-Augmented Generation (RAG) system generally involves three goals:

Improving the quality, accuracy, and appropriateness of the responses that the RAG system generates.
Generating responses faster — this may mean lower latency or higher throughput.
Doing it cheaper — lowering the cost per response or cost of the system overall.

In this article, we offer a few RAG optimization techniques for all parts of your RAG system.

A brief overview of RAG architecture

A RAG system often consists of three separate components—intent detection and routing, retrieval, and generation—and you need to optimize each of these separately. However, poor performance in an upstream RAG phase has a cascading effect on the overall performance, so earlier phases should not be skipped or glossed over.

Important note: Your RAG performance metrics are only meaningful if you evaluate them using your own data.

As we mention in our article on RAG architecture, there is no single RAG architecture that’s representative of every single RAG system because they can be so different — but there are a few key components that tend to be present in all RAG systems. You can see how they work together in the following diagram.

Components of a RAG system (that may need to be individually optimized)

Let’s see how we can optimize each component individually.

General techniques for optimizing your RAG

Across all phases, you will likely get a performance improvement by using the right model for each task and fine-tuning the model (via domain specific or task specific fine-tuning).

In all cases, high-quality, clean datasets and scenarios will also help you get better performance, as cleaner data results in lower need for data processing and reduces the likelihood of errors due to data quality.

RAG optimization at the intent detection/routing phase

At the intent detection and routing phase, you can make the RAG system more optimal by making sure that irrelevant and expensive queries get filtered out and that the routing is as accurate as possible. Here are the specific RAG optimizations you can do in this phase:

Filter out off-topic, out of scope, or malicious queries before routing them to any downstream system. It's crucial to filter out these irrelevant or harmful queries so that they don’t affect the performance of the system and don’t use resources unnecessarily. Consider that some filtering queries may be cheaper and faster than others.

Consider using a small classification model for intent detection. By using a classification model for this task, rather than a more general, larger model, you can minimize latency and operating costs while still ensuring accurate routing.

Diagram showing how a smaller, less complex model can do intent detection as well as an LLM.

Fine-tune the classification model to your specific domains. Once you’re using a suitable model for the intent detection task, you can further improve performance by fine-tuning it on the specific kinds of queries that your users are likely to supply.

Use custom routing logic. If you know that specific queries will get processed faster or more effectively using data from a particular source, you can implement that in your routing logic rather than letting the classification model make that decision every time.

Evaluate your intent detection model to ensure that it is correctly identifying intent. Use specific metrics suitable for evaluating intent detection classification models, such as accuracy, F1 score, precision, and recall.

How Okareo can help with RAG optimizations at the intent detection and routing phase

Evaluates classification models: Okareo provides tools that evaluate the classification of user intents, ensuring that the system correctly identifies and categorizes requests. Learn more about Okareo's classification evaluations.

Generates synthetic data for improving test coverage and performance: To enhance the accuracy and reliability of intent detection, Okareo offers synthetic data generation. This helps improve test coverage and overall system performance. Read more on how synthetic data can help.

Supports intent detection fine-tuning: Okareo also supports fine-tuning models for intent detection, allowing you to further optimize understanding and routing of specific types of queries. Explore fine-tuning options.

RAG optimization at the retrieval phase

Once the intent is clear, the next step is deciding which data source to use for your retrieval phase. While much attention is on vector databases, your data might be in different relational databases, data lakes, graph databases, or a combination of these, depending on the specific use case. In some cases, you want to route the request to an API or perform a web search. This decision is key, because the quality of the response heavily depends on the relevance and accuracy of the chosen data source.

Here are a few RAG optimization techniques for the retrieval phase:

Baseline on BM25: BM25 (Best Matching 25) is a basic keyword-based ranking algorithm that gives a relevance score based on things like term frequency, document frequency, and document length. This is a very efficient way to narrow down your data and provide an initial benchmark against which your retrieval system can be evaluated. After this initial retrieval stage, you can narrow things down even faster with a reranking stage that typically uses a similarity search algorithm to rerank documents according to which is most similar to the query.

Choose an appropriate embedding model: Your embedding model is the key component to evaluate in terms of latency, cost, and retrieval performance. It's worth noting that you don't always need a large, expensive model — many smaller models are quite powerful and can be good enough. Your embedding model is a key part of the reranking stage, so during evaluation, you should compare any performance gains to your BM25 baselines.

Use a hybrid search approach: When deciding between modern sparse vector models (for example, SPLADE, which mixes traditional term-matching techniques like that of BM25 with deep learning) and dense vector models like Sentence-BERT, consider that a hybrid setup often works best. This means retrieving documents from both models and combining the top results of each, giving extra weighting to one of the models depending on whether exact matching or semantic similarity are more important. In many cases, a hybrid setup delivers better results for general use cases.

Build evaluations using your own data and your typical queries: These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.

Balance the footprint of retrieve and reranking models with their performance: You'll need to strike a balance between speed, cost, and accuracy.

Fine-tune retrieve and reranking models on domain-specific data (both synthetic and real-word data).

How Okareo can help with RAG optimization at the retrieval phase

Explore Okareo's retrieval evaluation: Due to intrinsic complexity of retrieval, there is a need for performance evaluation dedicated to the retrieval phase of RAG. If you can’t isolate and trap problems here, they cascade into downstream failures.

Optimizing for your data: Okareo provides guidance on selecting an embedding model that fits your data. Learn more about embedding model selection.

Synthetic generation of evaluation data: Synthetically generate evaluation scenarios (testing data that consists of a series of model inputs paired with its corresponding expected result) based on your own data and typical queries.

RAG optimization at the generation phase

In this phase, your RAG undertakes complex reasoning and decision-making tasks. These tasks can complicate performance evaluations, as they require the model to not only generate text but also make logical decisions based on the context provided. Some RAG optimization techniques for this phase include:

Use LLM-as-a-judge: Using another LLM to evaluate the output of your LLM can enhance the quality of your output without requiring a human in the loop. LLM-as-a-judge uses a reflective model, or “quality model,” to review the result of your generation model for errors. You can either have one LLM evaluate another, or an LLM can evaluate itself.

Use reflection tuning: Reflection tuning is a more advanced version of using LLM-as-a-judge that cycles iteratively between the generation model and reflective model, refining the output until the reflective model is satisfied that it is high quality. A reflective model could be a more powerful or the same model as the generation model, and it takes advantage of LLMs reflective property to find errors and inconsistencies when that is the focus of the prompt.

How Okareo can help with RAG optimization at the generation phase

Get started with generation evaluations: Evaluate how well your models are generating relevant and accurate outputs based on the context provided.

Scoring a generative model's output: Ensure the output of your generative models meets the quality standards required for your application.

Add LLM evaluation to your CI workflow: Maintain high-quality output as you develop and deploy your models by integrating your LLM evaluation into your continuous integration (CI) workflow.

What's next?

Would you like to get started with optimizing your RAG? Sign up for Okareo and follow our documentation to get going.

Optimizing a Retrieval-Augmented Generation (RAG) system generally involves three goals:

Improving the quality, accuracy, and appropriateness of the responses that the RAG system generates.
Generating responses faster — this may mean lower latency or higher throughput.
Doing it cheaper — lowering the cost per response or cost of the system overall.