Synthetic Data Loop
Evaluation
Boris Selitser
,
Co-founder of Okareo
May 5, 2024
TL;DR
Synthetic data is a key ingredient in building apps that use LLMs.
Better data coverage drives robustness and predictability of an LLM, ensuring it effectively handles your unique task and use case.
Synthetic data touches all phases of development: Describing expected model behavior, Prototyping and Experimenting, Testing and Evaluation, Model Fine-Tuning.
Pre-Production vs. Production use has different roles for synthetic data.
Why should you care?
If you are building an AI app that calls out to an LLM, this will make your life much easier. Once you start mixing LLMs into your app stack, a few basic questions arise:
How do you select the best LLM to perform your unique task?
Will it work for all the scenarios that you expect in your app?
How stable and reliable will the experience be for your users and use cases?
There are a ton of benchmarks and leaderboards out there (probably too many) comparing model performance on some predefined, abstract metrics, but how do you know if it will work for you? Change the metric to the one your app cares about, the underlying data, and all of a sudden you are building your own benchmark. Hold on, you need reliability and robustness in your LLM calls, but you don’t have that kind of time. This is where synthetic data comes in.
Establishing Baselines
The key idea to begin with is that you need to choose a set of metrics that capture your unique requirements. Once those metrics are clear, you want to measure baseline performance on them. This contrasts with traditional software engineering, where the main ‘metric’ was passing expected tests. Establishing baselines is important because LLM output is varied, non-deterministic, and there is never 100% certainty that it is correct. Having baselines and constantly evaluating your metric progress is the only way your app will ever see the light of production. Just like tests in traditional software development, baselines enable quick iterations and evolution of all the moving parts - context fed into the model, prompts, model version, and so on.
Now, all of these metric baselines are only as good as the range of data they were calculated on. Using a few ad-hoc input samples doesn't represent reality or build confidence. Getting representative data samples, in particular for new AI-powered experiences, is nearly impossible. Synthetic data addresses this problem and is an important tool in the AI engineering toolkit. Synthetic scenarios should be organized by feature and expected model behaviors. Each scenario is paired with baseline metrics. The number of these behavior scenarios and data points within each scenario maps to coverage. As AI feature development matures, expectations and error conditions become clearer and this coverage naturally grows along with confidence in application readiness.
Synthetic Data Loop
Establishing repeatable evaluations in dev pipelines (CI/CD) is a high ROI activity. It is also becoming a standard in LLM app development. Evaluations require sufficient data and clear signals to guide improvement decisions based on metrics movement. Synthetic data expands visibility into model behaviors and, coupled with evaluations, makes these decisions more data-driven.
The diagram shows how synthetic data applies to various LLM app development stages. Not all AI teams leverage all stages. As teams start with ad-hoc testing and manual evaluations, there are natural opportunities to drive automation. Synthetic data connects and accelerates each stage, maximizing leverage and enabling the complete development loop.
Describe Model Behaviors - The loop starts with you identifying the key behaviors your app cares about. Synthetic scenarios are generated for each behavior in iterations until there are sufficient samples and quality. As new features are introduced or gaps identified, this step would need to be repeated.
Iterate & Evaluate - With behaviors and baselines defined, there is a solid foundation for fast iterations on prompts, model context, retrieval parameters. Evaluations here can be automatic, based on user feedback, or human evaluation. The target outcome is that all the quick improvements (e.g., retrieval model, prompt, LLM parameters) are identified and applied.
Distill Failures - Stubborn or intermittent failures will remain from Iterate & Evaluate and are ideal input for this stage. The synthetic generation process is used to better map the triggering conditions of the stubborn failures and to triage their impact. This is accomplished by categorizing failure types, synthetically generating adjacent input variations, and feeding them through the same evaluation criteria. This gives visibility to the persistent failure modes and expands the expected model behavior scenarios.
Fine-Tune Behavior - At a point when cheaper solutions (i.e., prompt tuning) have been exhausted, it is time to fine-tune the underlying model. Synthetic data is leveraged to generate fine-tuning instruction sets with positive and negative samples based on the prior stage of distilled failures.
Let’s Take an Example
Let’s take an example of building an app while maintaining focus on how to leverage synthetic data. Here, we'll outline how the above stages map at a high level and demonstrate how to implement them in a follow-up blog.
Let’s say we are building a collaboration app that is agent-based and includes a Meeting Scheduling Agent, Meeting Summarization Agent, and Personalized Doc Retrieval Agent. We are interested in building the user intent detection and routing module that will call out to a model to make decisions on what is the right agent to choose and when to get more information from the user if the intent is not clear. Let’s step through each stage:
Describe Model Behaviors - You can foresee the following scenarios that capture model behaviors you care about:
Out of scope, off-topic requests with appropriate guidance to the user.
Robustness related scenarios on jailbreaks, prompt leaks, terse or cryptic input.
Ambiguous or borderline requests that require clarification with the user.
Highly specific scheduling or summarization requests that imply immediate handling by the relevant agent.
You can start with a few manually crafted examples or sample data from internal dogfooding and use synthetic generation to multiply and create variations within each category. You focus the generation on conversational, informal tone and layer in typical user input issues around sentence fragments, misspellings. You could also create scenario templates and vary key parameters for the behaviors you care about.
It’s important to think of this stage as a synthetic data pipeline with a data quality evaluation step that filters out low-quality data points. Once the pipeline is set up, you can introduce new generation sources into it, e.g., A/B testing, production.
Iterate & Evaluate - Here it’s important to identify key metrics that will both guide development quality and product decision-making. For our example app, this could be:
Standard classification metrics (e.g., Accuracy of selecting the right agent, F1, etc.)
The number of dialog turns to successful routing decision.
User sentiment.
Dialog tone and brevity.
It’s easy to see how these would be automated and become a standard part of your CI/CD process that runs through a set of behavior scenarios on a regular basis. This step needs to be fast, and automating it gains leverage across the whole system. With chosen metrics and behavior scenarios from the previous step, you can rapidly iterate on the prompt, model version, and other parameters. Baselines ensure you are making progress without fixing one part while breaking another.
Distill Failures - Based on the evaluation results from the previous stage, you can concentrate your efforts on persistent failures that couldn't be resolved. If you encounter persistent failures with ambiguous user requests and robustness scenarios, you can use specialized synthetic generators to create more variations and examples within each behavior-scenario group. All generated scenarios still pass through evaluations from the previous step to determine if they produce new failure examples. At this stage, synthetic generation would probe the boundaries of particular scenarios and effectively provide automated ‘red teaming’ of your app.
Fine-Tune Behavior - A critical aspect of getting this stage right is having high-quality instruction data to tune on. You could argue that all the previous stages have worked together to make this one much easier. Key behaviors have been identified, behavior gaps mapped with synthetic data samples, and data curated by passing through evaluations. What remains is converting this data into behavior-specific instruction sets.
Pre-Production vs. Production
Interestingly, the same evaluations and metrics you set up to decide if your app is ready for production are also useful once the app is in production. There are no guarantees that the LLM outputs that were ‘passing’ before will continue to do so. Naturally, production will introduce new, unforeseen scenarios. Evaluations can be used offline against production data to identify new failure types and online to determine if the LLM output is of sufficient quality for use (e.g., displaying LLM-generated code to users).
In combination with evaluations, synthetic data helps here as well. Coverage is expanded via synthetic generation of new scenarios to include new production failure modes. This will ensure that these expected behaviors are adequately covered in regular evaluations and CI/CD.
TL;DR
Synthetic data is a key ingredient in building apps that use LLMs.
Better data coverage drives robustness and predictability of an LLM, ensuring it effectively handles your unique task and use case.
Synthetic data touches all phases of development: Describing expected model behavior, Prototyping and Experimenting, Testing and Evaluation, Model Fine-Tuning.
Pre-Production vs. Production use has different roles for synthetic data.
Why should you care?
If you are building an AI app that calls out to an LLM, this will make your life much easier. Once you start mixing LLMs into your app stack, a few basic questions arise:
How do you select the best LLM to perform your unique task?
Will it work for all the scenarios that you expect in your app?
How stable and reliable will the experience be for your users and use cases?
There are a ton of benchmarks and leaderboards out there (probably too many) comparing model performance on some predefined, abstract metrics, but how do you know if it will work for you? Change the metric to the one your app cares about, the underlying data, and all of a sudden you are building your own benchmark. Hold on, you need reliability and robustness in your LLM calls, but you don’t have that kind of time. This is where synthetic data comes in.
Establishing Baselines
The key idea to begin with is that you need to choose a set of metrics that capture your unique requirements. Once those metrics are clear, you want to measure baseline performance on them. This contrasts with traditional software engineering, where the main ‘metric’ was passing expected tests. Establishing baselines is important because LLM output is varied, non-deterministic, and there is never 100% certainty that it is correct. Having baselines and constantly evaluating your metric progress is the only way your app will ever see the light of production. Just like tests in traditional software development, baselines enable quick iterations and evolution of all the moving parts - context fed into the model, prompts, model version, and so on.
Now, all of these metric baselines are only as good as the range of data they were calculated on. Using a few ad-hoc input samples doesn't represent reality or build confidence. Getting representative data samples, in particular for new AI-powered experiences, is nearly impossible. Synthetic data addresses this problem and is an important tool in the AI engineering toolkit. Synthetic scenarios should be organized by feature and expected model behaviors. Each scenario is paired with baseline metrics. The number of these behavior scenarios and data points within each scenario maps to coverage. As AI feature development matures, expectations and error conditions become clearer and this coverage naturally grows along with confidence in application readiness.
Synthetic Data Loop
Establishing repeatable evaluations in dev pipelines (CI/CD) is a high ROI activity. It is also becoming a standard in LLM app development. Evaluations require sufficient data and clear signals to guide improvement decisions based on metrics movement. Synthetic data expands visibility into model behaviors and, coupled with evaluations, makes these decisions more data-driven.
The diagram shows how synthetic data applies to various LLM app development stages. Not all AI teams leverage all stages. As teams start with ad-hoc testing and manual evaluations, there are natural opportunities to drive automation. Synthetic data connects and accelerates each stage, maximizing leverage and enabling the complete development loop.
Describe Model Behaviors - The loop starts with you identifying the key behaviors your app cares about. Synthetic scenarios are generated for each behavior in iterations until there are sufficient samples and quality. As new features are introduced or gaps identified, this step would need to be repeated.
Iterate & Evaluate - With behaviors and baselines defined, there is a solid foundation for fast iterations on prompts, model context, retrieval parameters. Evaluations here can be automatic, based on user feedback, or human evaluation. The target outcome is that all the quick improvements (e.g., retrieval model, prompt, LLM parameters) are identified and applied.
Distill Failures - Stubborn or intermittent failures will remain from Iterate & Evaluate and are ideal input for this stage. The synthetic generation process is used to better map the triggering conditions of the stubborn failures and to triage their impact. This is accomplished by categorizing failure types, synthetically generating adjacent input variations, and feeding them through the same evaluation criteria. This gives visibility to the persistent failure modes and expands the expected model behavior scenarios.
Fine-Tune Behavior - At a point when cheaper solutions (i.e., prompt tuning) have been exhausted, it is time to fine-tune the underlying model. Synthetic data is leveraged to generate fine-tuning instruction sets with positive and negative samples based on the prior stage of distilled failures.
Let’s Take an Example
Let’s take an example of building an app while maintaining focus on how to leverage synthetic data. Here, we'll outline how the above stages map at a high level and demonstrate how to implement them in a follow-up blog.
Let’s say we are building a collaboration app that is agent-based and includes a Meeting Scheduling Agent, Meeting Summarization Agent, and Personalized Doc Retrieval Agent. We are interested in building the user intent detection and routing module that will call out to a model to make decisions on what is the right agent to choose and when to get more information from the user if the intent is not clear. Let’s step through each stage:
Describe Model Behaviors - You can foresee the following scenarios that capture model behaviors you care about:
Out of scope, off-topic requests with appropriate guidance to the user.
Robustness related scenarios on jailbreaks, prompt leaks, terse or cryptic input.
Ambiguous or borderline requests that require clarification with the user.
Highly specific scheduling or summarization requests that imply immediate handling by the relevant agent.
You can start with a few manually crafted examples or sample data from internal dogfooding and use synthetic generation to multiply and create variations within each category. You focus the generation on conversational, informal tone and layer in typical user input issues around sentence fragments, misspellings. You could also create scenario templates and vary key parameters for the behaviors you care about.
It’s important to think of this stage as a synthetic data pipeline with a data quality evaluation step that filters out low-quality data points. Once the pipeline is set up, you can introduce new generation sources into it, e.g., A/B testing, production.
Iterate & Evaluate - Here it’s important to identify key metrics that will both guide development quality and product decision-making. For our example app, this could be:
Standard classification metrics (e.g., Accuracy of selecting the right agent, F1, etc.)
The number of dialog turns to successful routing decision.
User sentiment.
Dialog tone and brevity.
It’s easy to see how these would be automated and become a standard part of your CI/CD process that runs through a set of behavior scenarios on a regular basis. This step needs to be fast, and automating it gains leverage across the whole system. With chosen metrics and behavior scenarios from the previous step, you can rapidly iterate on the prompt, model version, and other parameters. Baselines ensure you are making progress without fixing one part while breaking another.
Distill Failures - Based on the evaluation results from the previous stage, you can concentrate your efforts on persistent failures that couldn't be resolved. If you encounter persistent failures with ambiguous user requests and robustness scenarios, you can use specialized synthetic generators to create more variations and examples within each behavior-scenario group. All generated scenarios still pass through evaluations from the previous step to determine if they produce new failure examples. At this stage, synthetic generation would probe the boundaries of particular scenarios and effectively provide automated ‘red teaming’ of your app.
Fine-Tune Behavior - A critical aspect of getting this stage right is having high-quality instruction data to tune on. You could argue that all the previous stages have worked together to make this one much easier. Key behaviors have been identified, behavior gaps mapped with synthetic data samples, and data curated by passing through evaluations. What remains is converting this data into behavior-specific instruction sets.
Pre-Production vs. Production
Interestingly, the same evaluations and metrics you set up to decide if your app is ready for production are also useful once the app is in production. There are no guarantees that the LLM outputs that were ‘passing’ before will continue to do so. Naturally, production will introduce new, unforeseen scenarios. Evaluations can be used offline against production data to identify new failure types and online to determine if the LLM output is of sufficient quality for use (e.g., displaying LLM-generated code to users).
In combination with evaluations, synthetic data helps here as well. Coverage is expanded via synthetic generation of new scenarios to include new production failure modes. This will ensure that these expected behaviors are adequately covered in regular evaluations and CI/CD.
TL;DR
Synthetic data is a key ingredient in building apps that use LLMs.
Better data coverage drives robustness and predictability of an LLM, ensuring it effectively handles your unique task and use case.
Synthetic data touches all phases of development: Describing expected model behavior, Prototyping and Experimenting, Testing and Evaluation, Model Fine-Tuning.
Pre-Production vs. Production use has different roles for synthetic data.
Why should you care?
If you are building an AI app that calls out to an LLM, this will make your life much easier. Once you start mixing LLMs into your app stack, a few basic questions arise:
How do you select the best LLM to perform your unique task?
Will it work for all the scenarios that you expect in your app?
How stable and reliable will the experience be for your users and use cases?
There are a ton of benchmarks and leaderboards out there (probably too many) comparing model performance on some predefined, abstract metrics, but how do you know if it will work for you? Change the metric to the one your app cares about, the underlying data, and all of a sudden you are building your own benchmark. Hold on, you need reliability and robustness in your LLM calls, but you don’t have that kind of time. This is where synthetic data comes in.
Establishing Baselines
The key idea to begin with is that you need to choose a set of metrics that capture your unique requirements. Once those metrics are clear, you want to measure baseline performance on them. This contrasts with traditional software engineering, where the main ‘metric’ was passing expected tests. Establishing baselines is important because LLM output is varied, non-deterministic, and there is never 100% certainty that it is correct. Having baselines and constantly evaluating your metric progress is the only way your app will ever see the light of production. Just like tests in traditional software development, baselines enable quick iterations and evolution of all the moving parts - context fed into the model, prompts, model version, and so on.
Now, all of these metric baselines are only as good as the range of data they were calculated on. Using a few ad-hoc input samples doesn't represent reality or build confidence. Getting representative data samples, in particular for new AI-powered experiences, is nearly impossible. Synthetic data addresses this problem and is an important tool in the AI engineering toolkit. Synthetic scenarios should be organized by feature and expected model behaviors. Each scenario is paired with baseline metrics. The number of these behavior scenarios and data points within each scenario maps to coverage. As AI feature development matures, expectations and error conditions become clearer and this coverage naturally grows along with confidence in application readiness.
Synthetic Data Loop
Establishing repeatable evaluations in dev pipelines (CI/CD) is a high ROI activity. It is also becoming a standard in LLM app development. Evaluations require sufficient data and clear signals to guide improvement decisions based on metrics movement. Synthetic data expands visibility into model behaviors and, coupled with evaluations, makes these decisions more data-driven.
The diagram shows how synthetic data applies to various LLM app development stages. Not all AI teams leverage all stages. As teams start with ad-hoc testing and manual evaluations, there are natural opportunities to drive automation. Synthetic data connects and accelerates each stage, maximizing leverage and enabling the complete development loop.
Describe Model Behaviors - The loop starts with you identifying the key behaviors your app cares about. Synthetic scenarios are generated for each behavior in iterations until there are sufficient samples and quality. As new features are introduced or gaps identified, this step would need to be repeated.
Iterate & Evaluate - With behaviors and baselines defined, there is a solid foundation for fast iterations on prompts, model context, retrieval parameters. Evaluations here can be automatic, based on user feedback, or human evaluation. The target outcome is that all the quick improvements (e.g., retrieval model, prompt, LLM parameters) are identified and applied.
Distill Failures - Stubborn or intermittent failures will remain from Iterate & Evaluate and are ideal input for this stage. The synthetic generation process is used to better map the triggering conditions of the stubborn failures and to triage their impact. This is accomplished by categorizing failure types, synthetically generating adjacent input variations, and feeding them through the same evaluation criteria. This gives visibility to the persistent failure modes and expands the expected model behavior scenarios.
Fine-Tune Behavior - At a point when cheaper solutions (i.e., prompt tuning) have been exhausted, it is time to fine-tune the underlying model. Synthetic data is leveraged to generate fine-tuning instruction sets with positive and negative samples based on the prior stage of distilled failures.
Let’s Take an Example
Let’s take an example of building an app while maintaining focus on how to leverage synthetic data. Here, we'll outline how the above stages map at a high level and demonstrate how to implement them in a follow-up blog.
Let’s say we are building a collaboration app that is agent-based and includes a Meeting Scheduling Agent, Meeting Summarization Agent, and Personalized Doc Retrieval Agent. We are interested in building the user intent detection and routing module that will call out to a model to make decisions on what is the right agent to choose and when to get more information from the user if the intent is not clear. Let’s step through each stage:
Describe Model Behaviors - You can foresee the following scenarios that capture model behaviors you care about:
Out of scope, off-topic requests with appropriate guidance to the user.
Robustness related scenarios on jailbreaks, prompt leaks, terse or cryptic input.
Ambiguous or borderline requests that require clarification with the user.
Highly specific scheduling or summarization requests that imply immediate handling by the relevant agent.
You can start with a few manually crafted examples or sample data from internal dogfooding and use synthetic generation to multiply and create variations within each category. You focus the generation on conversational, informal tone and layer in typical user input issues around sentence fragments, misspellings. You could also create scenario templates and vary key parameters for the behaviors you care about.
It’s important to think of this stage as a synthetic data pipeline with a data quality evaluation step that filters out low-quality data points. Once the pipeline is set up, you can introduce new generation sources into it, e.g., A/B testing, production.
Iterate & Evaluate - Here it’s important to identify key metrics that will both guide development quality and product decision-making. For our example app, this could be:
Standard classification metrics (e.g., Accuracy of selecting the right agent, F1, etc.)
The number of dialog turns to successful routing decision.
User sentiment.
Dialog tone and brevity.
It’s easy to see how these would be automated and become a standard part of your CI/CD process that runs through a set of behavior scenarios on a regular basis. This step needs to be fast, and automating it gains leverage across the whole system. With chosen metrics and behavior scenarios from the previous step, you can rapidly iterate on the prompt, model version, and other parameters. Baselines ensure you are making progress without fixing one part while breaking another.
Distill Failures - Based on the evaluation results from the previous stage, you can concentrate your efforts on persistent failures that couldn't be resolved. If you encounter persistent failures with ambiguous user requests and robustness scenarios, you can use specialized synthetic generators to create more variations and examples within each behavior-scenario group. All generated scenarios still pass through evaluations from the previous step to determine if they produce new failure examples. At this stage, synthetic generation would probe the boundaries of particular scenarios and effectively provide automated ‘red teaming’ of your app.
Fine-Tune Behavior - A critical aspect of getting this stage right is having high-quality instruction data to tune on. You could argue that all the previous stages have worked together to make this one much easier. Key behaviors have been identified, behavior gaps mapped with synthetic data samples, and data curated by passing through evaluations. What remains is converting this data into behavior-specific instruction sets.
Pre-Production vs. Production
Interestingly, the same evaluations and metrics you set up to decide if your app is ready for production are also useful once the app is in production. There are no guarantees that the LLM outputs that were ‘passing’ before will continue to do so. Naturally, production will introduce new, unforeseen scenarios. Evaluations can be used offline against production data to identify new failure types and online to determine if the LLM output is of sufficient quality for use (e.g., displaying LLM-generated code to users).
In combination with evaluations, synthetic data helps here as well. Coverage is expanded via synthetic generation of new scenarios to include new production failure modes. This will ensure that these expected behaviors are adequately covered in regular evaluations and CI/CD.
TL;DR
Synthetic data is a key ingredient in building apps that use LLMs.
Better data coverage drives robustness and predictability of an LLM, ensuring it effectively handles your unique task and use case.
Synthetic data touches all phases of development: Describing expected model behavior, Prototyping and Experimenting, Testing and Evaluation, Model Fine-Tuning.
Pre-Production vs. Production use has different roles for synthetic data.
Why should you care?
If you are building an AI app that calls out to an LLM, this will make your life much easier. Once you start mixing LLMs into your app stack, a few basic questions arise:
How do you select the best LLM to perform your unique task?
Will it work for all the scenarios that you expect in your app?
How stable and reliable will the experience be for your users and use cases?
There are a ton of benchmarks and leaderboards out there (probably too many) comparing model performance on some predefined, abstract metrics, but how do you know if it will work for you? Change the metric to the one your app cares about, the underlying data, and all of a sudden you are building your own benchmark. Hold on, you need reliability and robustness in your LLM calls, but you don’t have that kind of time. This is where synthetic data comes in.
Establishing Baselines
The key idea to begin with is that you need to choose a set of metrics that capture your unique requirements. Once those metrics are clear, you want to measure baseline performance on them. This contrasts with traditional software engineering, where the main ‘metric’ was passing expected tests. Establishing baselines is important because LLM output is varied, non-deterministic, and there is never 100% certainty that it is correct. Having baselines and constantly evaluating your metric progress is the only way your app will ever see the light of production. Just like tests in traditional software development, baselines enable quick iterations and evolution of all the moving parts - context fed into the model, prompts, model version, and so on.
Now, all of these metric baselines are only as good as the range of data they were calculated on. Using a few ad-hoc input samples doesn't represent reality or build confidence. Getting representative data samples, in particular for new AI-powered experiences, is nearly impossible. Synthetic data addresses this problem and is an important tool in the AI engineering toolkit. Synthetic scenarios should be organized by feature and expected model behaviors. Each scenario is paired with baseline metrics. The number of these behavior scenarios and data points within each scenario maps to coverage. As AI feature development matures, expectations and error conditions become clearer and this coverage naturally grows along with confidence in application readiness.
Synthetic Data Loop
Establishing repeatable evaluations in dev pipelines (CI/CD) is a high ROI activity. It is also becoming a standard in LLM app development. Evaluations require sufficient data and clear signals to guide improvement decisions based on metrics movement. Synthetic data expands visibility into model behaviors and, coupled with evaluations, makes these decisions more data-driven.
The diagram shows how synthetic data applies to various LLM app development stages. Not all AI teams leverage all stages. As teams start with ad-hoc testing and manual evaluations, there are natural opportunities to drive automation. Synthetic data connects and accelerates each stage, maximizing leverage and enabling the complete development loop.
Describe Model Behaviors - The loop starts with you identifying the key behaviors your app cares about. Synthetic scenarios are generated for each behavior in iterations until there are sufficient samples and quality. As new features are introduced or gaps identified, this step would need to be repeated.
Iterate & Evaluate - With behaviors and baselines defined, there is a solid foundation for fast iterations on prompts, model context, retrieval parameters. Evaluations here can be automatic, based on user feedback, or human evaluation. The target outcome is that all the quick improvements (e.g., retrieval model, prompt, LLM parameters) are identified and applied.
Distill Failures - Stubborn or intermittent failures will remain from Iterate & Evaluate and are ideal input for this stage. The synthetic generation process is used to better map the triggering conditions of the stubborn failures and to triage their impact. This is accomplished by categorizing failure types, synthetically generating adjacent input variations, and feeding them through the same evaluation criteria. This gives visibility to the persistent failure modes and expands the expected model behavior scenarios.
Fine-Tune Behavior - At a point when cheaper solutions (i.e., prompt tuning) have been exhausted, it is time to fine-tune the underlying model. Synthetic data is leveraged to generate fine-tuning instruction sets with positive and negative samples based on the prior stage of distilled failures.
Let’s Take an Example
Let’s take an example of building an app while maintaining focus on how to leverage synthetic data. Here, we'll outline how the above stages map at a high level and demonstrate how to implement them in a follow-up blog.
Let’s say we are building a collaboration app that is agent-based and includes a Meeting Scheduling Agent, Meeting Summarization Agent, and Personalized Doc Retrieval Agent. We are interested in building the user intent detection and routing module that will call out to a model to make decisions on what is the right agent to choose and when to get more information from the user if the intent is not clear. Let’s step through each stage:
Describe Model Behaviors - You can foresee the following scenarios that capture model behaviors you care about:
Out of scope, off-topic requests with appropriate guidance to the user.
Robustness related scenarios on jailbreaks, prompt leaks, terse or cryptic input.
Ambiguous or borderline requests that require clarification with the user.
Highly specific scheduling or summarization requests that imply immediate handling by the relevant agent.
You can start with a few manually crafted examples or sample data from internal dogfooding and use synthetic generation to multiply and create variations within each category. You focus the generation on conversational, informal tone and layer in typical user input issues around sentence fragments, misspellings. You could also create scenario templates and vary key parameters for the behaviors you care about.
It’s important to think of this stage as a synthetic data pipeline with a data quality evaluation step that filters out low-quality data points. Once the pipeline is set up, you can introduce new generation sources into it, e.g., A/B testing, production.
Iterate & Evaluate - Here it’s important to identify key metrics that will both guide development quality and product decision-making. For our example app, this could be:
Standard classification metrics (e.g., Accuracy of selecting the right agent, F1, etc.)
The number of dialog turns to successful routing decision.
User sentiment.
Dialog tone and brevity.
It’s easy to see how these would be automated and become a standard part of your CI/CD process that runs through a set of behavior scenarios on a regular basis. This step needs to be fast, and automating it gains leverage across the whole system. With chosen metrics and behavior scenarios from the previous step, you can rapidly iterate on the prompt, model version, and other parameters. Baselines ensure you are making progress without fixing one part while breaking another.
Distill Failures - Based on the evaluation results from the previous stage, you can concentrate your efforts on persistent failures that couldn't be resolved. If you encounter persistent failures with ambiguous user requests and robustness scenarios, you can use specialized synthetic generators to create more variations and examples within each behavior-scenario group. All generated scenarios still pass through evaluations from the previous step to determine if they produce new failure examples. At this stage, synthetic generation would probe the boundaries of particular scenarios and effectively provide automated ‘red teaming’ of your app.
Fine-Tune Behavior - A critical aspect of getting this stage right is having high-quality instruction data to tune on. You could argue that all the previous stages have worked together to make this one much easier. Key behaviors have been identified, behavior gaps mapped with synthetic data samples, and data curated by passing through evaluations. What remains is converting this data into behavior-specific instruction sets.
Pre-Production vs. Production
Interestingly, the same evaluations and metrics you set up to decide if your app is ready for production are also useful once the app is in production. There are no guarantees that the LLM outputs that were ‘passing’ before will continue to do so. Naturally, production will introduce new, unforeseen scenarios. Evaluations can be used offline against production data to identify new failure types and online to determine if the LLM output is of sufficient quality for use (e.g., displaying LLM-generated code to users).
In combination with evaluations, synthetic data helps here as well. Coverage is expanded via synthetic generation of new scenarios to include new production failure modes. This will ensure that these expected behaviors are adequately covered in regular evaluations and CI/CD.