Optimizing RAG

RAG

Boris Selitser

,

Co-founder of Okareo

March 9, 2024

Optimizing a RAG system generally involves three things:

  1. Improving the quality, accuracy and appropriateness of the responses that the RAG system generates.

  2. Generating responses faster — this may mean lower latency or higher throughput.

  3. Doing it cheaper — so lower cost per response or lower cost of the system overall.

In this article we’re offering a few techniques for making optimizations in all parts of the RAG system.

Brief overview of RAG architecture

A RAG system often consists of 3 separate components: Intent detection+routing, retrieval, and generation. Each needs to be optimized separately.

Important note: Your RAG performance metrics are only meaningful if you evaluate them using your own data. Poor performance in an upstream RAG phase has a cascading effect on the overall performance.

As we mention in our article on RAG architecture, there is no single RAG architecture that’s representative of every single RAG system because they can be so different — but there are a few key components that tend to be present in all RAG systems. You can see how they work together in the following diagram.

Architecture diagram showing the key components that tend to be present in all RAG systems

Let’s see how we can optimize each component individually.

General techniques for getting better results from a RAG

Across all phases, you will likely get a performance improvement by using the right model for each task, and fine-tuning the model (via domain specific or task specific fine-tuning). More on this below.

In all cases, high-quality, clean datasets and scenarios will also help you get better performance, as cleaner data results in lower need for data processing and reduces the likelihood of errors due to data quality.

Optimizing RAG intent detection/routing phase

At the intent detection and routing phase, you can make the RAG system more optimal by making sure that irrelevant and expensive queries get filtered out, and that the routing is as accurate as possible. Here are the specific things you can do:

Filter out off-topic, out of scope, or malicious queries before routing them to any downstream system. It's crucial to filter out these irrelevant or harmful queries so that they don’t affect the performance of the system and don’t use resources unnecessarily. Consider that some filtering queries may be less expensive and faster than others.

Consider using a small classification model for intent detection. By using a classification model for this task, rather than a more general, larger model you can minimize latency and operating costs while still ensuring accurate routing.

Fine tune the classification model to your specific domains. Once you’re using a suitable model for the intent detection task, you can further improve performance by fine-tuning it on the specific kinds of queries that your users are likely to supply.

Use Custom routing logic. If you know that specific queries will get processed faster or more effectively using data from a particular source, you can implement that in your routing logic rather than letting the classification model make that decision every time.

Evaluate your intent detection model to ensure that it is correctly identifying intent. Use specific metrics suitable for evaluating intent detection models, such as accuracy, F1 score, precision, and recall.

How Okareo can help with optimizations at the intent detection and routing phase

Classification: Okareo provides tools that evaluate the classification of user intents, ensuring that the system correctly identifies and categorizes requests. Learn more about Okareo's classification evals.

Using Synthetic Data to Improve Test Coverage and Performance: To enhance the accuracy and reliability of intent detection, Okareo offers synthetic data generation. This helps improve test coverage and overall system performance. Read more on how synthetic data can help.

Intent Detection Fine-Tuning: Okareo also supports fine-tuning models for intent detection, allowing you to further optimize understanding and routing of specific types of queries. Explore fine-tuning options.

Optimizing RAG at the retrieval phase

Once the intent is clear, the next step is deciding which data source to use. While much attention is on vector databases, your data might be in different relational databases, data lakes, graph databases, or a combination of these, depending on the specific use case. In some cases, you want to route the request to an API or perform a web search. This decision is key because the quality of the response heavily depends on the relevance and accuracy of the chosen data source.

Here are a few specific things that you can do to optimize retrieval.

Baseline on BM25: BM25 is a keyword-based retrieval algorithm that ranks documents based on how well they match a user’s query. You can use this to quickly narrow down the amount of vectors in your search space before using the similarity search algorithm. This is a basic algorithm that doesn’t use vectors — it just looks for exact matches between terms in the user’s query and the text of each document. Despite the hype around vector models, it’s often best to start with a baseline such as BM25 as it’s very efficient at this early narrowing down process.

Embedding Model Selection: Embedding model is the key choice to evaluate in terms of latency, cost, and retrieval performance. Many smaller models are quite powerful and can be ‘good enough’. When evaluating your embedding model, compare performance gains to your BM25 baselines.

Hybrid Search: A hybrid setup often works best, combining modern sparse (e.g. SPLADE) and dense vector models. This means retrieving documents from both models and combining the top results of each, giving extra weighting to one of the models depending on whether exact matching or semantic similarity are more important. In many cases, a hybrid setup delivers better results for general use cases.

Build evaluations for Retrieve and Reranking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.

Balance the footprint of Retrieve and Reranking models with their performance (speed and accuracy). You'll need to strike a balance between trade-off between speed, cost, and accuracy.

Fine tune Retrieve and Reranking models on domain-specific data (both synthetic and real-word data).

How Okareo can help with optimizing retrieval

Explore Okareo's retrieval evaluation: Evaluating performance of retrieval requires specialized tools that are unique to understanding and debugging vectors and embeddings. Small issues in retrieval can have a cascade effect on down stream actions such as function calling and generation. Okareo provides unique tooling specific to retrieval optimization.

Optimizing For Your Data: Okareo provides guidance on selecting an embedding model that fits your data. Learn more about embedding model selection.

Synthetic Generation of Evaluation Data: Build evaluations for Retrieve and ReRanking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.

Optimizing RAG at the generation phase

In this phase, the system undertakes complex reasoning and decision-making tasks. These tasks can complicate performance evaluations, as they require the model to not only generate text but also make logical decisions based on the context provided.

Cycle Between Generation Model and Reflective Model: In many implementations, before the result is returned from the Generation Model it is reviewed for errors by a Reflective Model or ‘quality model’. The Reflective Model could be a more powerful model or the even the same model as the Generation model. The idea is to take advantage of the LLMs reflective property to find errors and inconsistencies when in the focus of the prompt.

How Okareo can help with optimizing generation

Get Started with Generation Evaluations: Evaluate how well your models are generating relevant and accurate outputs based on the context provided.

Scoring a Generative Model's Output: Ensure the output of your generative models meets the quality standards required for your application.

Add LLM Evaluation to Your CI Workflow: To maintain high-quality output as you develop and deploy your models integrate LLM evaluation into your continuous integration (CI) workflow.

What's next?

Would you like to get started with optimizing your RAG?  Sign up for Okareo and follow our documentation to get going.

Optimizing a RAG system generally involves three things:

  1. Improving the quality, accuracy and appropriateness of the responses that the RAG system generates.

  2. Generating responses faster — this may mean lower latency or higher throughput.

  3. Doing it cheaper — so lower cost per response or lower cost of the system overall.

In this article we’re offering a few techniques for making optimizations in all parts of the RAG system.

Brief overview of RAG architecture

A RAG system often consists of 3 separate components: Intent detection+routing, retrieval, and generation. Each needs to be optimized separately.

Important note: Your RAG performance metrics are only meaningful if you evaluate them using your own data. Poor performance in an upstream RAG phase has a cascading effect on the overall performance.

As we mention in our article on RAG architecture, there is no single RAG architecture that’s representative of every single RAG system because they can be so different — but there are a few key components that tend to be present in all RAG systems. You can see how they work together in the following diagram.

Architecture diagram showing the key components that tend to be present in all RAG systems

Let’s see how we can optimize each component individually.

General techniques for getting better results from a RAG

Across all phases, you will likely get a performance improvement by using the right model for each task, and fine-tuning the model (via domain specific or task specific fine-tuning). More on this below.

In all cases, high-quality, clean datasets and scenarios will also help you get better performance, as cleaner data results in lower need for data processing and reduces the likelihood of errors due to data quality.

Optimizing RAG intent detection/routing phase

At the intent detection and routing phase, you can make the RAG system more optimal by making sure that irrelevant and expensive queries get filtered out, and that the routing is as accurate as possible. Here are the specific things you can do:

Filter out off-topic, out of scope, or malicious queries before routing them to any downstream system. It's crucial to filter out these irrelevant or harmful queries so that they don’t affect the performance of the system and don’t use resources unnecessarily. Consider that some filtering queries may be less expensive and faster than others.

Consider using a small classification model for intent detection. By using a classification model for this task, rather than a more general, larger model you can minimize latency and operating costs while still ensuring accurate routing.

Fine tune the classification model to your specific domains. Once you’re using a suitable model for the intent detection task, you can further improve performance by fine-tuning it on the specific kinds of queries that your users are likely to supply.

Use Custom routing logic. If you know that specific queries will get processed faster or more effectively using data from a particular source, you can implement that in your routing logic rather than letting the classification model make that decision every time.

Evaluate your intent detection model to ensure that it is correctly identifying intent. Use specific metrics suitable for evaluating intent detection models, such as accuracy, F1 score, precision, and recall.

How Okareo can help with optimizations at the intent detection and routing phase

Classification: Okareo provides tools that evaluate the classification of user intents, ensuring that the system correctly identifies and categorizes requests. Learn more about Okareo's classification evals.

Using Synthetic Data to Improve Test Coverage and Performance: To enhance the accuracy and reliability of intent detection, Okareo offers synthetic data generation. This helps improve test coverage and overall system performance. Read more on how synthetic data can help.

Intent Detection Fine-Tuning: Okareo also supports fine-tuning models for intent detection, allowing you to further optimize understanding and routing of specific types of queries. Explore fine-tuning options.

Optimizing RAG at the retrieval phase

Once the intent is clear, the next step is deciding which data source to use. While much attention is on vector databases, your data might be in different relational databases, data lakes, graph databases, or a combination of these, depending on the specific use case. In some cases, you want to route the request to an API or perform a web search. This decision is key because the quality of the response heavily depends on the relevance and accuracy of the chosen data source.

Here are a few specific things that you can do to optimize retrieval.

Baseline on BM25: BM25 is a keyword-based retrieval algorithm that ranks documents based on how well they match a user’s query. You can use this to quickly narrow down the amount of vectors in your search space before using the similarity search algorithm. This is a basic algorithm that doesn’t use vectors — it just looks for exact matches between terms in the user’s query and the text of each document. Despite the hype around vector models, it’s often best to start with a baseline such as BM25 as it’s very efficient at this early narrowing down process.

Embedding Model Selection: Embedding model is the key choice to evaluate in terms of latency, cost, and retrieval performance. Many smaller models are quite powerful and can be ‘good enough’. When evaluating your embedding model, compare performance gains to your BM25 baselines.

Hybrid Search: A hybrid setup often works best, combining modern sparse (e.g. SPLADE) and dense vector models. This means retrieving documents from both models and combining the top results of each, giving extra weighting to one of the models depending on whether exact matching or semantic similarity are more important. In many cases, a hybrid setup delivers better results for general use cases.

Build evaluations for Retrieve and Reranking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.

Balance the footprint of Retrieve and Reranking models with their performance (speed and accuracy). You'll need to strike a balance between trade-off between speed, cost, and accuracy.

Fine tune Retrieve and Reranking models on domain-specific data (both synthetic and real-word data).

How Okareo can help with optimizing retrieval

Explore Okareo's retrieval evaluation: Evaluating performance of retrieval requires specialized tools that are unique to understanding and debugging vectors and embeddings. Small issues in retrieval can have a cascade effect on down stream actions such as function calling and generation. Okareo provides unique tooling specific to retrieval optimization.

Optimizing For Your Data: Okareo provides guidance on selecting an embedding model that fits your data. Learn more about embedding model selection.

Synthetic Generation of Evaluation Data: Build evaluations for Retrieve and ReRanking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.

Optimizing RAG at the generation phase

In this phase, the system undertakes complex reasoning and decision-making tasks. These tasks can complicate performance evaluations, as they require the model to not only generate text but also make logical decisions based on the context provided.

Cycle Between Generation Model and Reflective Model: In many implementations, before the result is returned from the Generation Model it is reviewed for errors by a Reflective Model or ‘quality model’. The Reflective Model could be a more powerful model or the even the same model as the Generation model. The idea is to take advantage of the LLMs reflective property to find errors and inconsistencies when in the focus of the prompt.

How Okareo can help with optimizing generation

Get Started with Generation Evaluations: Evaluate how well your models are generating relevant and accurate outputs based on the context provided.

Scoring a Generative Model's Output: Ensure the output of your generative models meets the quality standards required for your application.

Add LLM Evaluation to Your CI Workflow: To maintain high-quality output as you develop and deploy your models integrate LLM evaluation into your continuous integration (CI) workflow.

What's next?

Would you like to get started with optimizing your RAG?  Sign up for Okareo and follow our documentation to get going.

Optimizing a RAG system generally involves three things:

  1. Improving the quality, accuracy and appropriateness of the responses that the RAG system generates.

  2. Generating responses faster — this may mean lower latency or higher throughput.

  3. Doing it cheaper — so lower cost per response or lower cost of the system overall.

In this article we’re offering a few techniques for making optimizations in all parts of the RAG system.

Brief overview of RAG architecture

A RAG system often consists of 3 separate components: Intent detection+routing, retrieval, and generation. Each needs to be optimized separately.

Important note: Your RAG performance metrics are only meaningful if you evaluate them using your own data. Poor performance in an upstream RAG phase has a cascading effect on the overall performance.

As we mention in our article on RAG architecture, there is no single RAG architecture that’s representative of every single RAG system because they can be so different — but there are a few key components that tend to be present in all RAG systems. You can see how they work together in the following diagram.

Architecture diagram showing the key components that tend to be present in all RAG systems

Let’s see how we can optimize each component individually.

General techniques for getting better results from a RAG

Across all phases, you will likely get a performance improvement by using the right model for each task, and fine-tuning the model (via domain specific or task specific fine-tuning). More on this below.

In all cases, high-quality, clean datasets and scenarios will also help you get better performance, as cleaner data results in lower need for data processing and reduces the likelihood of errors due to data quality.

Optimizing RAG intent detection/routing phase

At the intent detection and routing phase, you can make the RAG system more optimal by making sure that irrelevant and expensive queries get filtered out, and that the routing is as accurate as possible. Here are the specific things you can do:

Filter out off-topic, out of scope, or malicious queries before routing them to any downstream system. It's crucial to filter out these irrelevant or harmful queries so that they don’t affect the performance of the system and don’t use resources unnecessarily. Consider that some filtering queries may be less expensive and faster than others.

Consider using a small classification model for intent detection. By using a classification model for this task, rather than a more general, larger model you can minimize latency and operating costs while still ensuring accurate routing.

Fine tune the classification model to your specific domains. Once you’re using a suitable model for the intent detection task, you can further improve performance by fine-tuning it on the specific kinds of queries that your users are likely to supply.

Use Custom routing logic. If you know that specific queries will get processed faster or more effectively using data from a particular source, you can implement that in your routing logic rather than letting the classification model make that decision every time.

Evaluate your intent detection model to ensure that it is correctly identifying intent. Use specific metrics suitable for evaluating intent detection models, such as accuracy, F1 score, precision, and recall.

How Okareo can help with optimizations at the intent detection and routing phase

Classification: Okareo provides tools that evaluate the classification of user intents, ensuring that the system correctly identifies and categorizes requests. Learn more about Okareo's classification evals.

Using Synthetic Data to Improve Test Coverage and Performance: To enhance the accuracy and reliability of intent detection, Okareo offers synthetic data generation. This helps improve test coverage and overall system performance. Read more on how synthetic data can help.

Intent Detection Fine-Tuning: Okareo also supports fine-tuning models for intent detection, allowing you to further optimize understanding and routing of specific types of queries. Explore fine-tuning options.

Optimizing RAG at the retrieval phase

Once the intent is clear, the next step is deciding which data source to use. While much attention is on vector databases, your data might be in different relational databases, data lakes, graph databases, or a combination of these, depending on the specific use case. In some cases, you want to route the request to an API or perform a web search. This decision is key because the quality of the response heavily depends on the relevance and accuracy of the chosen data source.

Here are a few specific things that you can do to optimize retrieval.

Baseline on BM25: BM25 is a keyword-based retrieval algorithm that ranks documents based on how well they match a user’s query. You can use this to quickly narrow down the amount of vectors in your search space before using the similarity search algorithm. This is a basic algorithm that doesn’t use vectors — it just looks for exact matches between terms in the user’s query and the text of each document. Despite the hype around vector models, it’s often best to start with a baseline such as BM25 as it’s very efficient at this early narrowing down process.

Embedding Model Selection: Embedding model is the key choice to evaluate in terms of latency, cost, and retrieval performance. Many smaller models are quite powerful and can be ‘good enough’. When evaluating your embedding model, compare performance gains to your BM25 baselines.

Hybrid Search: A hybrid setup often works best, combining modern sparse (e.g. SPLADE) and dense vector models. This means retrieving documents from both models and combining the top results of each, giving extra weighting to one of the models depending on whether exact matching or semantic similarity are more important. In many cases, a hybrid setup delivers better results for general use cases.

Build evaluations for Retrieve and Reranking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.

Balance the footprint of Retrieve and Reranking models with their performance (speed and accuracy). You'll need to strike a balance between trade-off between speed, cost, and accuracy.

Fine tune Retrieve and Reranking models on domain-specific data (both synthetic and real-word data).

How Okareo can help with optimizing retrieval

Explore Okareo's retrieval evaluation: Evaluating performance of retrieval requires specialized tools that are unique to understanding and debugging vectors and embeddings. Small issues in retrieval can have a cascade effect on down stream actions such as function calling and generation. Okareo provides unique tooling specific to retrieval optimization.

Optimizing For Your Data: Okareo provides guidance on selecting an embedding model that fits your data. Learn more about embedding model selection.

Synthetic Generation of Evaluation Data: Build evaluations for Retrieve and ReRanking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.

Optimizing RAG at the generation phase

In this phase, the system undertakes complex reasoning and decision-making tasks. These tasks can complicate performance evaluations, as they require the model to not only generate text but also make logical decisions based on the context provided.

Cycle Between Generation Model and Reflective Model: In many implementations, before the result is returned from the Generation Model it is reviewed for errors by a Reflective Model or ‘quality model’. The Reflective Model could be a more powerful model or the even the same model as the Generation model. The idea is to take advantage of the LLMs reflective property to find errors and inconsistencies when in the focus of the prompt.

How Okareo can help with optimizing generation

Get Started with Generation Evaluations: Evaluate how well your models are generating relevant and accurate outputs based on the context provided.

Scoring a Generative Model's Output: Ensure the output of your generative models meets the quality standards required for your application.

Add LLM Evaluation to Your CI Workflow: To maintain high-quality output as you develop and deploy your models integrate LLM evaluation into your continuous integration (CI) workflow.

What's next?

Would you like to get started with optimizing your RAG?  Sign up for Okareo and follow our documentation to get going.

Optimizing a RAG system generally involves three things:

  1. Improving the quality, accuracy and appropriateness of the responses that the RAG system generates.

  2. Generating responses faster — this may mean lower latency or higher throughput.

  3. Doing it cheaper — so lower cost per response or lower cost of the system overall.

In this article we’re offering a few techniques for making optimizations in all parts of the RAG system.

Brief overview of RAG architecture

A RAG system often consists of 3 separate components: Intent detection+routing, retrieval, and generation. Each needs to be optimized separately.

Important note: Your RAG performance metrics are only meaningful if you evaluate them using your own data. Poor performance in an upstream RAG phase has a cascading effect on the overall performance.

As we mention in our article on RAG architecture, there is no single RAG architecture that’s representative of every single RAG system because they can be so different — but there are a few key components that tend to be present in all RAG systems. You can see how they work together in the following diagram.

Architecture diagram showing the key components that tend to be present in all RAG systems

Let’s see how we can optimize each component individually.

General techniques for getting better results from a RAG

Across all phases, you will likely get a performance improvement by using the right model for each task, and fine-tuning the model (via domain specific or task specific fine-tuning). More on this below.

In all cases, high-quality, clean datasets and scenarios will also help you get better performance, as cleaner data results in lower need for data processing and reduces the likelihood of errors due to data quality.

Optimizing RAG intent detection/routing phase

At the intent detection and routing phase, you can make the RAG system more optimal by making sure that irrelevant and expensive queries get filtered out, and that the routing is as accurate as possible. Here are the specific things you can do:

Filter out off-topic, out of scope, or malicious queries before routing them to any downstream system. It's crucial to filter out these irrelevant or harmful queries so that they don’t affect the performance of the system and don’t use resources unnecessarily. Consider that some filtering queries may be less expensive and faster than others.

Consider using a small classification model for intent detection. By using a classification model for this task, rather than a more general, larger model you can minimize latency and operating costs while still ensuring accurate routing.

Fine tune the classification model to your specific domains. Once you’re using a suitable model for the intent detection task, you can further improve performance by fine-tuning it on the specific kinds of queries that your users are likely to supply.

Use Custom routing logic. If you know that specific queries will get processed faster or more effectively using data from a particular source, you can implement that in your routing logic rather than letting the classification model make that decision every time.

Evaluate your intent detection model to ensure that it is correctly identifying intent. Use specific metrics suitable for evaluating intent detection models, such as accuracy, F1 score, precision, and recall.

How Okareo can help with optimizations at the intent detection and routing phase

Classification: Okareo provides tools that evaluate the classification of user intents, ensuring that the system correctly identifies and categorizes requests. Learn more about Okareo's classification evals.

Using Synthetic Data to Improve Test Coverage and Performance: To enhance the accuracy and reliability of intent detection, Okareo offers synthetic data generation. This helps improve test coverage and overall system performance. Read more on how synthetic data can help.

Intent Detection Fine-Tuning: Okareo also supports fine-tuning models for intent detection, allowing you to further optimize understanding and routing of specific types of queries. Explore fine-tuning options.

Optimizing RAG at the retrieval phase

Once the intent is clear, the next step is deciding which data source to use. While much attention is on vector databases, your data might be in different relational databases, data lakes, graph databases, or a combination of these, depending on the specific use case. In some cases, you want to route the request to an API or perform a web search. This decision is key because the quality of the response heavily depends on the relevance and accuracy of the chosen data source.

Here are a few specific things that you can do to optimize retrieval.

Baseline on BM25: BM25 is a keyword-based retrieval algorithm that ranks documents based on how well they match a user’s query. You can use this to quickly narrow down the amount of vectors in your search space before using the similarity search algorithm. This is a basic algorithm that doesn’t use vectors — it just looks for exact matches between terms in the user’s query and the text of each document. Despite the hype around vector models, it’s often best to start with a baseline such as BM25 as it’s very efficient at this early narrowing down process.

Embedding Model Selection: Embedding model is the key choice to evaluate in terms of latency, cost, and retrieval performance. Many smaller models are quite powerful and can be ‘good enough’. When evaluating your embedding model, compare performance gains to your BM25 baselines.

Hybrid Search: A hybrid setup often works best, combining modern sparse (e.g. SPLADE) and dense vector models. This means retrieving documents from both models and combining the top results of each, giving extra weighting to one of the models depending on whether exact matching or semantic similarity are more important. In many cases, a hybrid setup delivers better results for general use cases.

Build evaluations for Retrieve and Reranking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.

Balance the footprint of Retrieve and Reranking models with their performance (speed and accuracy). You'll need to strike a balance between trade-off between speed, cost, and accuracy.

Fine tune Retrieve and Reranking models on domain-specific data (both synthetic and real-word data).

How Okareo can help with optimizing retrieval

Explore Okareo's retrieval evaluation: Evaluating performance of retrieval requires specialized tools that are unique to understanding and debugging vectors and embeddings. Small issues in retrieval can have a cascade effect on down stream actions such as function calling and generation. Okareo provides unique tooling specific to retrieval optimization.

Optimizing For Your Data: Okareo provides guidance on selecting an embedding model that fits your data. Learn more about embedding model selection.

Synthetic Generation of Evaluation Data: Build evaluations for Retrieve and ReRanking stages using your data and your typical queries. These could come from production or be synthetically generated based on your seed inputs. You don’t want some random benchmark or dataset that will not show what relevance means for your app.

Optimizing RAG at the generation phase

In this phase, the system undertakes complex reasoning and decision-making tasks. These tasks can complicate performance evaluations, as they require the model to not only generate text but also make logical decisions based on the context provided.

Cycle Between Generation Model and Reflective Model: In many implementations, before the result is returned from the Generation Model it is reviewed for errors by a Reflective Model or ‘quality model’. The Reflective Model could be a more powerful model or the even the same model as the Generation model. The idea is to take advantage of the LLMs reflective property to find errors and inconsistencies when in the focus of the prompt.

How Okareo can help with optimizing generation

Get Started with Generation Evaluations: Evaluate how well your models are generating relevant and accurate outputs based on the context provided.

Scoring a Generative Model's Output: Ensure the output of your generative models meets the quality standards required for your application.

Add LLM Evaluation to Your CI Workflow: To maintain high-quality output as you develop and deploy your models integrate LLM evaluation into your continuous integration (CI) workflow.

What's next?

Would you like to get started with optimizing your RAG?  Sign up for Okareo and follow our documentation to get going.

Share:

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.