Synthetic Data Generation for LLM Evaluation
Evaluation
Matt Wyman
,
CEO/Co-Founder
Sarah Barber
,
Senior Technical Writer
December 9, 2024
Synthetic data generation for LLMs is no longer just for data scientists. While synthetic data has long been important to data scientists to these guys for training their models, it's now becoming just as important to machine learning engineers and software engineers who work on LLM-powered apps.
If you plan to run your LLM products in production, you need to be confident in their abilities. This is something that LLM evaluation and fine-tuning can give you, and you'll need to generate synthetic data for them.
This article explains synthetic data generation, its importance to developers, and how Okareo (a tool for evaluating LLMs, RAG and agents) can be used to generate small amounts of synthetic data for specific use cases, and how you can use it to bias your model in a specific direction.
What is synthetic data generation?
Synthetic data generation means creating artificial data that's similar to real data. It's useful when real-world data is limited, unavailable, or needs to be kept private.
While most discussions of synthetic data generation in AI tend to focus on the data scientist's use case of generating huge amounts of data to train models they built from scratch, at Okareo, we're more interested in how synthetic data can enable tasks like LLM evaluation and fine-tuning without needing to use live production data (containing potentially sensitive information).
This type of synthetic data generation tends to focus on generating smaller amounts of data for testing that your LLM is producing useful and accurate results. These tests need to be run on a regular basis whenever a change has occurred to ensure that the behavior of your LLM hasn't regressed.
Synthetic data vs. mock data
Traditionally, data scientists have tended to use the term "synthetic data," whereas for software engineers, "mock data" has been a more common term. Both terms have been used to describe the generation of artificial data for testing purposes; however, there is an important distinction between the two.
The main difference is that synthetic data tends to closely mirror real-world data, whereas mock data is usually created to support specific test scenarios. Mock data may also include unrealistic or clearly fictional values such as "John Doe, 123 Example Street." The aim when producing synthetic data is to provide a realistic range of values (although it can cover edge cases if configured to do so).
Another difference is the way that each type of data is generated. Synthetic data tends to be generated in more complex ways — using algorithms or AI — whereas mock data tends to be created manually or using simple scripts.
The importance of synthetic data generation for LLM app testing
Testing your LLM app involves checking that the LLM produces accurate, relevant, and consistent responses, including when it's presented with unusual input or edge cases — such as ambiguous questions, unusual phrasing, very formal or informal tones, intentionally misleading questions, off-topic questions, mixed-emotion queries, or queries containing misspellings or contractions. If it can't, you'll need to do prompt tuning or fine-tune your model.
In order to test each specific type of use case or edge case, it's better to generate smaller sets of very specific data than the large amounts that data scientists use to originally train their model. OpenAI recommends generating 50 to 100 high-quality synthetic data examples for fine-tuning; however, sometimes even smaller amounts will do if you're checking something highly specific or trying to bias your model in a very specific direction.
Let's say you began by checking seventy-five different scenarios and ten of them failed. A good rule of thumb is to generate five synthetic data examples per failure. For each of these groups of five, if three or more are still failing, then this is probably indicative of a systemic failure rather than an outlier, and it means you need to do more prompt tuning or fine-tuning.
Generating this type of synthetic data for very specific test scenarios is useful when testing LLMs apps. Some examples of this are below:
Example 1: Meeting summarization assistant
You could test an app that summarizes meetings by generating different types of meeting notes (one set for an office team meeting, another for a town hall council meeting, and some with typos or people talking off topic) and seeing how well they're summarized. To save time, you can also generate the expected output; however, you may wish to manually verify that the summaries are good after generating them.
Some synthetic data mimicking notes from a town hall council meeting, generated for test purposes.
Example 2: Product recommendation app
A product recommendation app that takes a description of a product like "A waterproof smartwatch with GPS for running, under $250" and returns a suggested product is a great candidate for generating synthetic data. You could start by generating different testing scenarios to find out how the app handles edge cases that are not typically available in real-world data — such as "waterproof but not GPS-enabled" (most products typically combine these features) or "smartwatch for under $10" (there are unlikely to be any products available this cheap).
You might also want to generate data to test ambiguous or ill-defined queries like "durable watch for outdoor activities." For each test scenario, you can define the expected output to be a "perfect" example of a response, giving you a ground truth for comparison.
Example 3: Car service assistant
A car service app that takes queries like "My Toyota Corolla makes a squealing noise when I press the brakes" and produces output like "Front brake pads (for example, Brembo Ceramic Pads for Toyota Corolla 2016–2019)" or "Brake rotor resurfacing service" could use synthetically generated data to generate rephrasing of the same question to test for all the different ways that users might phrase something. For example, the following phrases could all refer to the same thing:
"Brakes squeal when stopping."
"Weird sound from the brakes."
"My car screeches when I hit the brake pedal."
You could also generate more complex scenarios to check how your system responds. For example, what happens if a user asks multiple questions at once? An example of this is "I need new tires for a 2015 Subaru Outback, but I also want to know how often to rotate them."
For all use cases, you need to ensure that the LLM responds appropriately, is consistent with its responses, and understands nuance.
Okareo's fast synthetic data generation tool
Okareo is a platform for evaluating LLMs and other machine learning models. To use Okareo, you create test scenarios (input data sometimes paired with expected outputs) and checks (metrics by which you want to evaluate the LLM, such as coherence, relevance, consistency, or more nebulous concepts like "how friendly the LLM's response was"). You then register your LLM with Okareo and run an evaluation on it using the scenarios and checks.
Generating synthetic data scenarios in-app
To generate synthetic data scenarios in Okareo, you can use the Synthetic Scenario Copilot. Describe the type of data you'd like to be generated, such as "Generate a five-sentence town hall meeting transcription about parking in East Anaheim and a resulting summary that is no longer than one sentence." Next, choose the number of rows of data you want, and Okareo will generate a bunch of scenarios on that theme. The app provides some suggestions for the type of descriptions that work well for testing classification models, the retrieval part of RAG systems or generative models which you can then modify for your own purposes.
You can use the same feature to generate different versions of the same scenarios, such as rephrased or misspelled inputs, or those that use contractions. Check the boxes of the rows you want to change, select from the options that appear below the table (such as "Rephrasing," "Misspellings" etc.), and then click the button at the bottom right to modify your scenarios. Finally, don't forget to save your newly generated scenario versions so they can be passed into an LLM evaluation.
Generating synthetic data in your code
Okareo also has TypeScript and Python SDKs for running LLM evaluations in code. To run this code locally, you'll need to download the Okareo CLI and follow the instructions to export any relevant environment variables and initialize an Okareo project.
This example assumes you're using the TypeScript SDK. The full working code for this example is available on our GitHub. We'll be using the product recommendation example from above.
Start by creating an initial seed scenario, which will become the basis from which you can generate variant scenarios — like rephrased or misspelled versions of the same queries. Add some test cases that include a sample user query and its corresponding expected result (or best example of a suitable response).
const INITIAL_SEED_DATA = [
SeedData({
input:"I want a waterproof smartwatch with GPS for running, under $250",
result:"The Garmin Forerunner 45 ($180) is a great option, offering GPS tracking, waterproofing, and running-specific features. Alternatively, the Amazfit Bip U Pro ($69) provides similar functionality at a lower price."
}),
SeedData({
input:"I need noise-canceling headphones for under $150",
result:"The Sony WH-CH710N ($130) offers effective noise cancellation, up to 35 hours of battery life, and a comfortable design. For a more compact option, consider the Anker Soundcore Life Q30 ($79) with hybrid noise cancellation."
}),
SeedData({
input:"I’m looking for an eco-friendly yoga mat that’s not too expensive.",
result:"The Gaiam Cork Yoga Mat ($49) is a sustainable option made with cork and TPE. Another great choice is the Liforme Travel Mat ($120), which uses biodegradable materials and offers excellent grip."
}),
SeedData({
input:"What’s a good beginner’s acoustic guitar for under $200?",
result:"The Yamaha FG800 ($199) is a highly recommended beginner guitar with excellent tone and durability. If you’re looking for something smaller, the Fender FA-15 ($149) is a great option."
})
];
const seed_scenario: any = await okareo.create_scenario_set(
{
name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
project_id: project_id,
seed_data: INITIAL_SEED_DATA
}
);
From this, you can generate a synthetic data scenario of similar queries. The example below will generate five different misspelled versions for each of the four existing scenarios. To generate rephrased versions instead of misspelled ones, the generation type would need to be ScenarioType.REPHRASE_INVARIANT
. The different scenario types that Okareo can generate are listed here.
const misspelled_scenario: any = await okareo.generate_scenario_set(
{
project_id: project_id,
name: `${SCENARIO_SET_NAME} Misspelled Scenario Set - ${UNIQUE_BUILD_ID}`,
source_scenario_id: seed_scenario.scenario_id,
number_examples: 5,
generation_type: ScenarioType.COMMON_MISSPELLINGS,
}
)
At this stage you can view your newly generated scenarios in the Okareo app. You can see that the existing four seed scenarios are included in the new scenario set, which means it's possible to use the techniques above to chain together different types of generated scenarios and then run your evaluation on a huge set of scenarios with different variants.
Next, you need to register your model with Okareo, passing in any templates or system prompts.
const USER_PROMPT_TEMPLATE = "{scenario_input}"
const RECOMMENDATION_CONTEXT_TEMPLATE = "You are an intelligent product recommendation assistant. Your job is to recommend products based on user queries. The user will describe their needs or preferences, and you will suggest 1-3 suitable products. Each recommendation should include the product name, a brief description, key features that match the query, and the price (if provided or relevant). If no exact match is available, suggest the closest alternatives that fulfill most of the user's requirements."
const model = await okareo.register_model({
name: MODEL_NAME,
tags: ["TAG_NAME"],
project_id: project_id,
models: {
type: "openai",
model_id:"gpt-3.5-turbo",
temperature:0.5,
system_prompt_template:RECOMMENDATION_CONTEXT_TEMPLATE,
user_prompt_template:USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});
Then you need to call the run_test function which runs the evaluation, while passing in a series of checks. In this example we're using standard pre-baked Okareo checks, but it's also possible to define your own.
const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: `${MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: misspelled_scenario,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: [
"coherence",
"consistency",
"fluency",
"relevance"
]
} as RunTestProps);
It's also worth setting up a GenerationReporter
object, which will report on which metrics pass or fail when your code is run. You can set minimum thresholds for each metric you're testing to ensure behavior doesn't regress every time you evaluate the LLM, and the reporter can announce whether your evaluation passed or failed according to each metric.
const report_definition = {
metrics_min: {
"coherence": 4.8,
"consistency": 4.8,
"fluency": 4.8,
"relevance": 4.8,
}
};
const reporter = new GenerationReporter({
eval_run :eval_run,
...report_definition,
});
reporter.log();
Finally, you can run your code with the okareo run
command, or you can make your Okareo evaluations part of a test suite such as Jest and run them along with other app tests. The results of your evaluation can then be viewed in the Okareo app, or on the command line:
The GenerationReporter showing which metrics did not pass and why for a failed evaluation run
Viewing the results in the Okareo app can help give you an overview so you can get a picture for what information is available. The larger scores for relevance, fluency, etc. near the graphs at the top are an average across the scenario set. You can also see these scores per row. The third row is selected, which has high scores for relevance, fluency and coherence. You can see the actual result is good quality and it's possible to click on the expected result and compare the two.
The results of an evaluation in the Okareo app
If your results aren't high enough, you may need to do some fine-tuning or consider how changes to the prompt might improve the results.
Use Okareo to generate synthetic data for your LLM evaluations
As an LLM app developer, you need to be sure that your system works as expected and can handle unusual inputs. Synthetic data generation helps by generating a bunch of different scenarios and edge cases, which can then be used to evaluate your LLM in a comprehensive way. You can then choose to tune your internal system prompt or fine tune your model if needed.
To start evaluating your LLM and easily generating test data for scenarios, try Okareo today.
Synthetic data generation for LLMs is no longer just for data scientists. While synthetic data has long been important to data scientists to these guys for training their models, it's now becoming just as important to machine learning engineers and software engineers who work on LLM-powered apps.
If you plan to run your LLM products in production, you need to be confident in their abilities. This is something that LLM evaluation and fine-tuning can give you, and you'll need to generate synthetic data for them.
This article explains synthetic data generation, its importance to developers, and how Okareo (a tool for evaluating LLMs, RAG and agents) can be used to generate small amounts of synthetic data for specific use cases, and how you can use it to bias your model in a specific direction.
What is synthetic data generation?
Synthetic data generation means creating artificial data that's similar to real data. It's useful when real-world data is limited, unavailable, or needs to be kept private.
While most discussions of synthetic data generation in AI tend to focus on the data scientist's use case of generating huge amounts of data to train models they built from scratch, at Okareo, we're more interested in how synthetic data can enable tasks like LLM evaluation and fine-tuning without needing to use live production data (containing potentially sensitive information).
This type of synthetic data generation tends to focus on generating smaller amounts of data for testing that your LLM is producing useful and accurate results. These tests need to be run on a regular basis whenever a change has occurred to ensure that the behavior of your LLM hasn't regressed.
Synthetic data vs. mock data
Traditionally, data scientists have tended to use the term "synthetic data," whereas for software engineers, "mock data" has been a more common term. Both terms have been used to describe the generation of artificial data for testing purposes; however, there is an important distinction between the two.
The main difference is that synthetic data tends to closely mirror real-world data, whereas mock data is usually created to support specific test scenarios. Mock data may also include unrealistic or clearly fictional values such as "John Doe, 123 Example Street." The aim when producing synthetic data is to provide a realistic range of values (although it can cover edge cases if configured to do so).
Another difference is the way that each type of data is generated. Synthetic data tends to be generated in more complex ways — using algorithms or AI — whereas mock data tends to be created manually or using simple scripts.
The importance of synthetic data generation for LLM app testing
Testing your LLM app involves checking that the LLM produces accurate, relevant, and consistent responses, including when it's presented with unusual input or edge cases — such as ambiguous questions, unusual phrasing, very formal or informal tones, intentionally misleading questions, off-topic questions, mixed-emotion queries, or queries containing misspellings or contractions. If it can't, you'll need to do prompt tuning or fine-tune your model.
In order to test each specific type of use case or edge case, it's better to generate smaller sets of very specific data than the large amounts that data scientists use to originally train their model. OpenAI recommends generating 50 to 100 high-quality synthetic data examples for fine-tuning; however, sometimes even smaller amounts will do if you're checking something highly specific or trying to bias your model in a very specific direction.
Let's say you began by checking seventy-five different scenarios and ten of them failed. A good rule of thumb is to generate five synthetic data examples per failure. For each of these groups of five, if three or more are still failing, then this is probably indicative of a systemic failure rather than an outlier, and it means you need to do more prompt tuning or fine-tuning.
Generating this type of synthetic data for very specific test scenarios is useful when testing LLMs apps. Some examples of this are below:
Example 1: Meeting summarization assistant
You could test an app that summarizes meetings by generating different types of meeting notes (one set for an office team meeting, another for a town hall council meeting, and some with typos or people talking off topic) and seeing how well they're summarized. To save time, you can also generate the expected output; however, you may wish to manually verify that the summaries are good after generating them.
Some synthetic data mimicking notes from a town hall council meeting, generated for test purposes.
Example 2: Product recommendation app
A product recommendation app that takes a description of a product like "A waterproof smartwatch with GPS for running, under $250" and returns a suggested product is a great candidate for generating synthetic data. You could start by generating different testing scenarios to find out how the app handles edge cases that are not typically available in real-world data — such as "waterproof but not GPS-enabled" (most products typically combine these features) or "smartwatch for under $10" (there are unlikely to be any products available this cheap).
You might also want to generate data to test ambiguous or ill-defined queries like "durable watch for outdoor activities." For each test scenario, you can define the expected output to be a "perfect" example of a response, giving you a ground truth for comparison.
Example 3: Car service assistant
A car service app that takes queries like "My Toyota Corolla makes a squealing noise when I press the brakes" and produces output like "Front brake pads (for example, Brembo Ceramic Pads for Toyota Corolla 2016–2019)" or "Brake rotor resurfacing service" could use synthetically generated data to generate rephrasing of the same question to test for all the different ways that users might phrase something. For example, the following phrases could all refer to the same thing:
"Brakes squeal when stopping."
"Weird sound from the brakes."
"My car screeches when I hit the brake pedal."
You could also generate more complex scenarios to check how your system responds. For example, what happens if a user asks multiple questions at once? An example of this is "I need new tires for a 2015 Subaru Outback, but I also want to know how often to rotate them."
For all use cases, you need to ensure that the LLM responds appropriately, is consistent with its responses, and understands nuance.
Okareo's fast synthetic data generation tool
Okareo is a platform for evaluating LLMs and other machine learning models. To use Okareo, you create test scenarios (input data sometimes paired with expected outputs) and checks (metrics by which you want to evaluate the LLM, such as coherence, relevance, consistency, or more nebulous concepts like "how friendly the LLM's response was"). You then register your LLM with Okareo and run an evaluation on it using the scenarios and checks.
Generating synthetic data scenarios in-app
To generate synthetic data scenarios in Okareo, you can use the Synthetic Scenario Copilot. Describe the type of data you'd like to be generated, such as "Generate a five-sentence town hall meeting transcription about parking in East Anaheim and a resulting summary that is no longer than one sentence." Next, choose the number of rows of data you want, and Okareo will generate a bunch of scenarios on that theme. The app provides some suggestions for the type of descriptions that work well for testing classification models, the retrieval part of RAG systems or generative models which you can then modify for your own purposes.
You can use the same feature to generate different versions of the same scenarios, such as rephrased or misspelled inputs, or those that use contractions. Check the boxes of the rows you want to change, select from the options that appear below the table (such as "Rephrasing," "Misspellings" etc.), and then click the button at the bottom right to modify your scenarios. Finally, don't forget to save your newly generated scenario versions so they can be passed into an LLM evaluation.
Generating synthetic data in your code
Okareo also has TypeScript and Python SDKs for running LLM evaluations in code. To run this code locally, you'll need to download the Okareo CLI and follow the instructions to export any relevant environment variables and initialize an Okareo project.
This example assumes you're using the TypeScript SDK. The full working code for this example is available on our GitHub. We'll be using the product recommendation example from above.
Start by creating an initial seed scenario, which will become the basis from which you can generate variant scenarios — like rephrased or misspelled versions of the same queries. Add some test cases that include a sample user query and its corresponding expected result (or best example of a suitable response).
const INITIAL_SEED_DATA = [
SeedData({
input:"I want a waterproof smartwatch with GPS for running, under $250",
result:"The Garmin Forerunner 45 ($180) is a great option, offering GPS tracking, waterproofing, and running-specific features. Alternatively, the Amazfit Bip U Pro ($69) provides similar functionality at a lower price."
}),
SeedData({
input:"I need noise-canceling headphones for under $150",
result:"The Sony WH-CH710N ($130) offers effective noise cancellation, up to 35 hours of battery life, and a comfortable design. For a more compact option, consider the Anker Soundcore Life Q30 ($79) with hybrid noise cancellation."
}),
SeedData({
input:"I’m looking for an eco-friendly yoga mat that’s not too expensive.",
result:"The Gaiam Cork Yoga Mat ($49) is a sustainable option made with cork and TPE. Another great choice is the Liforme Travel Mat ($120), which uses biodegradable materials and offers excellent grip."
}),
SeedData({
input:"What’s a good beginner’s acoustic guitar for under $200?",
result:"The Yamaha FG800 ($199) is a highly recommended beginner guitar with excellent tone and durability. If you’re looking for something smaller, the Fender FA-15 ($149) is a great option."
})
];
const seed_scenario: any = await okareo.create_scenario_set(
{
name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
project_id: project_id,
seed_data: INITIAL_SEED_DATA
}
);
From this, you can generate a synthetic data scenario of similar queries. The example below will generate five different misspelled versions for each of the four existing scenarios. To generate rephrased versions instead of misspelled ones, the generation type would need to be ScenarioType.REPHRASE_INVARIANT
. The different scenario types that Okareo can generate are listed here.
const misspelled_scenario: any = await okareo.generate_scenario_set(
{
project_id: project_id,
name: `${SCENARIO_SET_NAME} Misspelled Scenario Set - ${UNIQUE_BUILD_ID}`,
source_scenario_id: seed_scenario.scenario_id,
number_examples: 5,
generation_type: ScenarioType.COMMON_MISSPELLINGS,
}
)
At this stage you can view your newly generated scenarios in the Okareo app. You can see that the existing four seed scenarios are included in the new scenario set, which means it's possible to use the techniques above to chain together different types of generated scenarios and then run your evaluation on a huge set of scenarios with different variants.
Next, you need to register your model with Okareo, passing in any templates or system prompts.
const USER_PROMPT_TEMPLATE = "{scenario_input}"
const RECOMMENDATION_CONTEXT_TEMPLATE = "You are an intelligent product recommendation assistant. Your job is to recommend products based on user queries. The user will describe their needs or preferences, and you will suggest 1-3 suitable products. Each recommendation should include the product name, a brief description, key features that match the query, and the price (if provided or relevant). If no exact match is available, suggest the closest alternatives that fulfill most of the user's requirements."
const model = await okareo.register_model({
name: MODEL_NAME,
tags: ["TAG_NAME"],
project_id: project_id,
models: {
type: "openai",
model_id:"gpt-3.5-turbo",
temperature:0.5,
system_prompt_template:RECOMMENDATION_CONTEXT_TEMPLATE,
user_prompt_template:USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});
Then you need to call the run_test function which runs the evaluation, while passing in a series of checks. In this example we're using standard pre-baked Okareo checks, but it's also possible to define your own.
const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: `${MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: misspelled_scenario,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: [
"coherence",
"consistency",
"fluency",
"relevance"
]
} as RunTestProps);
It's also worth setting up a GenerationReporter
object, which will report on which metrics pass or fail when your code is run. You can set minimum thresholds for each metric you're testing to ensure behavior doesn't regress every time you evaluate the LLM, and the reporter can announce whether your evaluation passed or failed according to each metric.
const report_definition = {
metrics_min: {
"coherence": 4.8,
"consistency": 4.8,
"fluency": 4.8,
"relevance": 4.8,
}
};
const reporter = new GenerationReporter({
eval_run :eval_run,
...report_definition,
});
reporter.log();
Finally, you can run your code with the okareo run
command, or you can make your Okareo evaluations part of a test suite such as Jest and run them along with other app tests. The results of your evaluation can then be viewed in the Okareo app, or on the command line:
The GenerationReporter showing which metrics did not pass and why for a failed evaluation run
Viewing the results in the Okareo app can help give you an overview so you can get a picture for what information is available. The larger scores for relevance, fluency, etc. near the graphs at the top are an average across the scenario set. You can also see these scores per row. The third row is selected, which has high scores for relevance, fluency and coherence. You can see the actual result is good quality and it's possible to click on the expected result and compare the two.
The results of an evaluation in the Okareo app
If your results aren't high enough, you may need to do some fine-tuning or consider how changes to the prompt might improve the results.
Use Okareo to generate synthetic data for your LLM evaluations
As an LLM app developer, you need to be sure that your system works as expected and can handle unusual inputs. Synthetic data generation helps by generating a bunch of different scenarios and edge cases, which can then be used to evaluate your LLM in a comprehensive way. You can then choose to tune your internal system prompt or fine tune your model if needed.
To start evaluating your LLM and easily generating test data for scenarios, try Okareo today.
Synthetic data generation for LLMs is no longer just for data scientists. While synthetic data has long been important to data scientists to these guys for training their models, it's now becoming just as important to machine learning engineers and software engineers who work on LLM-powered apps.
If you plan to run your LLM products in production, you need to be confident in their abilities. This is something that LLM evaluation and fine-tuning can give you, and you'll need to generate synthetic data for them.
This article explains synthetic data generation, its importance to developers, and how Okareo (a tool for evaluating LLMs, RAG and agents) can be used to generate small amounts of synthetic data for specific use cases, and how you can use it to bias your model in a specific direction.
What is synthetic data generation?
Synthetic data generation means creating artificial data that's similar to real data. It's useful when real-world data is limited, unavailable, or needs to be kept private.
While most discussions of synthetic data generation in AI tend to focus on the data scientist's use case of generating huge amounts of data to train models they built from scratch, at Okareo, we're more interested in how synthetic data can enable tasks like LLM evaluation and fine-tuning without needing to use live production data (containing potentially sensitive information).
This type of synthetic data generation tends to focus on generating smaller amounts of data for testing that your LLM is producing useful and accurate results. These tests need to be run on a regular basis whenever a change has occurred to ensure that the behavior of your LLM hasn't regressed.
Synthetic data vs. mock data
Traditionally, data scientists have tended to use the term "synthetic data," whereas for software engineers, "mock data" has been a more common term. Both terms have been used to describe the generation of artificial data for testing purposes; however, there is an important distinction between the two.
The main difference is that synthetic data tends to closely mirror real-world data, whereas mock data is usually created to support specific test scenarios. Mock data may also include unrealistic or clearly fictional values such as "John Doe, 123 Example Street." The aim when producing synthetic data is to provide a realistic range of values (although it can cover edge cases if configured to do so).
Another difference is the way that each type of data is generated. Synthetic data tends to be generated in more complex ways — using algorithms or AI — whereas mock data tends to be created manually or using simple scripts.
The importance of synthetic data generation for LLM app testing
Testing your LLM app involves checking that the LLM produces accurate, relevant, and consistent responses, including when it's presented with unusual input or edge cases — such as ambiguous questions, unusual phrasing, very formal or informal tones, intentionally misleading questions, off-topic questions, mixed-emotion queries, or queries containing misspellings or contractions. If it can't, you'll need to do prompt tuning or fine-tune your model.
In order to test each specific type of use case or edge case, it's better to generate smaller sets of very specific data than the large amounts that data scientists use to originally train their model. OpenAI recommends generating 50 to 100 high-quality synthetic data examples for fine-tuning; however, sometimes even smaller amounts will do if you're checking something highly specific or trying to bias your model in a very specific direction.
Let's say you began by checking seventy-five different scenarios and ten of them failed. A good rule of thumb is to generate five synthetic data examples per failure. For each of these groups of five, if three or more are still failing, then this is probably indicative of a systemic failure rather than an outlier, and it means you need to do more prompt tuning or fine-tuning.
Generating this type of synthetic data for very specific test scenarios is useful when testing LLMs apps. Some examples of this are below:
Example 1: Meeting summarization assistant
You could test an app that summarizes meetings by generating different types of meeting notes (one set for an office team meeting, another for a town hall council meeting, and some with typos or people talking off topic) and seeing how well they're summarized. To save time, you can also generate the expected output; however, you may wish to manually verify that the summaries are good after generating them.
Some synthetic data mimicking notes from a town hall council meeting, generated for test purposes.
Example 2: Product recommendation app
A product recommendation app that takes a description of a product like "A waterproof smartwatch with GPS for running, under $250" and returns a suggested product is a great candidate for generating synthetic data. You could start by generating different testing scenarios to find out how the app handles edge cases that are not typically available in real-world data — such as "waterproof but not GPS-enabled" (most products typically combine these features) or "smartwatch for under $10" (there are unlikely to be any products available this cheap).
You might also want to generate data to test ambiguous or ill-defined queries like "durable watch for outdoor activities." For each test scenario, you can define the expected output to be a "perfect" example of a response, giving you a ground truth for comparison.
Example 3: Car service assistant
A car service app that takes queries like "My Toyota Corolla makes a squealing noise when I press the brakes" and produces output like "Front brake pads (for example, Brembo Ceramic Pads for Toyota Corolla 2016–2019)" or "Brake rotor resurfacing service" could use synthetically generated data to generate rephrasing of the same question to test for all the different ways that users might phrase something. For example, the following phrases could all refer to the same thing:
"Brakes squeal when stopping."
"Weird sound from the brakes."
"My car screeches when I hit the brake pedal."
You could also generate more complex scenarios to check how your system responds. For example, what happens if a user asks multiple questions at once? An example of this is "I need new tires for a 2015 Subaru Outback, but I also want to know how often to rotate them."
For all use cases, you need to ensure that the LLM responds appropriately, is consistent with its responses, and understands nuance.
Okareo's fast synthetic data generation tool
Okareo is a platform for evaluating LLMs and other machine learning models. To use Okareo, you create test scenarios (input data sometimes paired with expected outputs) and checks (metrics by which you want to evaluate the LLM, such as coherence, relevance, consistency, or more nebulous concepts like "how friendly the LLM's response was"). You then register your LLM with Okareo and run an evaluation on it using the scenarios and checks.
Generating synthetic data scenarios in-app
To generate synthetic data scenarios in Okareo, you can use the Synthetic Scenario Copilot. Describe the type of data you'd like to be generated, such as "Generate a five-sentence town hall meeting transcription about parking in East Anaheim and a resulting summary that is no longer than one sentence." Next, choose the number of rows of data you want, and Okareo will generate a bunch of scenarios on that theme. The app provides some suggestions for the type of descriptions that work well for testing classification models, the retrieval part of RAG systems or generative models which you can then modify for your own purposes.
You can use the same feature to generate different versions of the same scenarios, such as rephrased or misspelled inputs, or those that use contractions. Check the boxes of the rows you want to change, select from the options that appear below the table (such as "Rephrasing," "Misspellings" etc.), and then click the button at the bottom right to modify your scenarios. Finally, don't forget to save your newly generated scenario versions so they can be passed into an LLM evaluation.
Generating synthetic data in your code
Okareo also has TypeScript and Python SDKs for running LLM evaluations in code. To run this code locally, you'll need to download the Okareo CLI and follow the instructions to export any relevant environment variables and initialize an Okareo project.
This example assumes you're using the TypeScript SDK. The full working code for this example is available on our GitHub. We'll be using the product recommendation example from above.
Start by creating an initial seed scenario, which will become the basis from which you can generate variant scenarios — like rephrased or misspelled versions of the same queries. Add some test cases that include a sample user query and its corresponding expected result (or best example of a suitable response).
const INITIAL_SEED_DATA = [
SeedData({
input:"I want a waterproof smartwatch with GPS for running, under $250",
result:"The Garmin Forerunner 45 ($180) is a great option, offering GPS tracking, waterproofing, and running-specific features. Alternatively, the Amazfit Bip U Pro ($69) provides similar functionality at a lower price."
}),
SeedData({
input:"I need noise-canceling headphones for under $150",
result:"The Sony WH-CH710N ($130) offers effective noise cancellation, up to 35 hours of battery life, and a comfortable design. For a more compact option, consider the Anker Soundcore Life Q30 ($79) with hybrid noise cancellation."
}),
SeedData({
input:"I’m looking for an eco-friendly yoga mat that’s not too expensive.",
result:"The Gaiam Cork Yoga Mat ($49) is a sustainable option made with cork and TPE. Another great choice is the Liforme Travel Mat ($120), which uses biodegradable materials and offers excellent grip."
}),
SeedData({
input:"What’s a good beginner’s acoustic guitar for under $200?",
result:"The Yamaha FG800 ($199) is a highly recommended beginner guitar with excellent tone and durability. If you’re looking for something smaller, the Fender FA-15 ($149) is a great option."
})
];
const seed_scenario: any = await okareo.create_scenario_set(
{
name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
project_id: project_id,
seed_data: INITIAL_SEED_DATA
}
);
From this, you can generate a synthetic data scenario of similar queries. The example below will generate five different misspelled versions for each of the four existing scenarios. To generate rephrased versions instead of misspelled ones, the generation type would need to be ScenarioType.REPHRASE_INVARIANT
. The different scenario types that Okareo can generate are listed here.
const misspelled_scenario: any = await okareo.generate_scenario_set(
{
project_id: project_id,
name: `${SCENARIO_SET_NAME} Misspelled Scenario Set - ${UNIQUE_BUILD_ID}`,
source_scenario_id: seed_scenario.scenario_id,
number_examples: 5,
generation_type: ScenarioType.COMMON_MISSPELLINGS,
}
)
At this stage you can view your newly generated scenarios in the Okareo app. You can see that the existing four seed scenarios are included in the new scenario set, which means it's possible to use the techniques above to chain together different types of generated scenarios and then run your evaluation on a huge set of scenarios with different variants.
Next, you need to register your model with Okareo, passing in any templates or system prompts.
const USER_PROMPT_TEMPLATE = "{scenario_input}"
const RECOMMENDATION_CONTEXT_TEMPLATE = "You are an intelligent product recommendation assistant. Your job is to recommend products based on user queries. The user will describe their needs or preferences, and you will suggest 1-3 suitable products. Each recommendation should include the product name, a brief description, key features that match the query, and the price (if provided or relevant). If no exact match is available, suggest the closest alternatives that fulfill most of the user's requirements."
const model = await okareo.register_model({
name: MODEL_NAME,
tags: ["TAG_NAME"],
project_id: project_id,
models: {
type: "openai",
model_id:"gpt-3.5-turbo",
temperature:0.5,
system_prompt_template:RECOMMENDATION_CONTEXT_TEMPLATE,
user_prompt_template:USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});
Then you need to call the run_test function which runs the evaluation, while passing in a series of checks. In this example we're using standard pre-baked Okareo checks, but it's also possible to define your own.
const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: `${MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: misspelled_scenario,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: [
"coherence",
"consistency",
"fluency",
"relevance"
]
} as RunTestProps);
It's also worth setting up a GenerationReporter
object, which will report on which metrics pass or fail when your code is run. You can set minimum thresholds for each metric you're testing to ensure behavior doesn't regress every time you evaluate the LLM, and the reporter can announce whether your evaluation passed or failed according to each metric.
const report_definition = {
metrics_min: {
"coherence": 4.8,
"consistency": 4.8,
"fluency": 4.8,
"relevance": 4.8,
}
};
const reporter = new GenerationReporter({
eval_run :eval_run,
...report_definition,
});
reporter.log();
Finally, you can run your code with the okareo run
command, or you can make your Okareo evaluations part of a test suite such as Jest and run them along with other app tests. The results of your evaluation can then be viewed in the Okareo app, or on the command line:
The GenerationReporter showing which metrics did not pass and why for a failed evaluation run
Viewing the results in the Okareo app can help give you an overview so you can get a picture for what information is available. The larger scores for relevance, fluency, etc. near the graphs at the top are an average across the scenario set. You can also see these scores per row. The third row is selected, which has high scores for relevance, fluency and coherence. You can see the actual result is good quality and it's possible to click on the expected result and compare the two.
The results of an evaluation in the Okareo app
If your results aren't high enough, you may need to do some fine-tuning or consider how changes to the prompt might improve the results.
Use Okareo to generate synthetic data for your LLM evaluations
As an LLM app developer, you need to be sure that your system works as expected and can handle unusual inputs. Synthetic data generation helps by generating a bunch of different scenarios and edge cases, which can then be used to evaluate your LLM in a comprehensive way. You can then choose to tune your internal system prompt or fine tune your model if needed.
To start evaluating your LLM and easily generating test data for scenarios, try Okareo today.
Synthetic data generation for LLMs is no longer just for data scientists. While synthetic data has long been important to data scientists to these guys for training their models, it's now becoming just as important to machine learning engineers and software engineers who work on LLM-powered apps.
If you plan to run your LLM products in production, you need to be confident in their abilities. This is something that LLM evaluation and fine-tuning can give you, and you'll need to generate synthetic data for them.
This article explains synthetic data generation, its importance to developers, and how Okareo (a tool for evaluating LLMs, RAG and agents) can be used to generate small amounts of synthetic data for specific use cases, and how you can use it to bias your model in a specific direction.
What is synthetic data generation?
Synthetic data generation means creating artificial data that's similar to real data. It's useful when real-world data is limited, unavailable, or needs to be kept private.
While most discussions of synthetic data generation in AI tend to focus on the data scientist's use case of generating huge amounts of data to train models they built from scratch, at Okareo, we're more interested in how synthetic data can enable tasks like LLM evaluation and fine-tuning without needing to use live production data (containing potentially sensitive information).
This type of synthetic data generation tends to focus on generating smaller amounts of data for testing that your LLM is producing useful and accurate results. These tests need to be run on a regular basis whenever a change has occurred to ensure that the behavior of your LLM hasn't regressed.
Synthetic data vs. mock data
Traditionally, data scientists have tended to use the term "synthetic data," whereas for software engineers, "mock data" has been a more common term. Both terms have been used to describe the generation of artificial data for testing purposes; however, there is an important distinction between the two.
The main difference is that synthetic data tends to closely mirror real-world data, whereas mock data is usually created to support specific test scenarios. Mock data may also include unrealistic or clearly fictional values such as "John Doe, 123 Example Street." The aim when producing synthetic data is to provide a realistic range of values (although it can cover edge cases if configured to do so).
Another difference is the way that each type of data is generated. Synthetic data tends to be generated in more complex ways — using algorithms or AI — whereas mock data tends to be created manually or using simple scripts.
The importance of synthetic data generation for LLM app testing
Testing your LLM app involves checking that the LLM produces accurate, relevant, and consistent responses, including when it's presented with unusual input or edge cases — such as ambiguous questions, unusual phrasing, very formal or informal tones, intentionally misleading questions, off-topic questions, mixed-emotion queries, or queries containing misspellings or contractions. If it can't, you'll need to do prompt tuning or fine-tune your model.
In order to test each specific type of use case or edge case, it's better to generate smaller sets of very specific data than the large amounts that data scientists use to originally train their model. OpenAI recommends generating 50 to 100 high-quality synthetic data examples for fine-tuning; however, sometimes even smaller amounts will do if you're checking something highly specific or trying to bias your model in a very specific direction.
Let's say you began by checking seventy-five different scenarios and ten of them failed. A good rule of thumb is to generate five synthetic data examples per failure. For each of these groups of five, if three or more are still failing, then this is probably indicative of a systemic failure rather than an outlier, and it means you need to do more prompt tuning or fine-tuning.
Generating this type of synthetic data for very specific test scenarios is useful when testing LLMs apps. Some examples of this are below:
Example 1: Meeting summarization assistant
You could test an app that summarizes meetings by generating different types of meeting notes (one set for an office team meeting, another for a town hall council meeting, and some with typos or people talking off topic) and seeing how well they're summarized. To save time, you can also generate the expected output; however, you may wish to manually verify that the summaries are good after generating them.
Some synthetic data mimicking notes from a town hall council meeting, generated for test purposes.
Example 2: Product recommendation app
A product recommendation app that takes a description of a product like "A waterproof smartwatch with GPS for running, under $250" and returns a suggested product is a great candidate for generating synthetic data. You could start by generating different testing scenarios to find out how the app handles edge cases that are not typically available in real-world data — such as "waterproof but not GPS-enabled" (most products typically combine these features) or "smartwatch for under $10" (there are unlikely to be any products available this cheap).
You might also want to generate data to test ambiguous or ill-defined queries like "durable watch for outdoor activities." For each test scenario, you can define the expected output to be a "perfect" example of a response, giving you a ground truth for comparison.
Example 3: Car service assistant
A car service app that takes queries like "My Toyota Corolla makes a squealing noise when I press the brakes" and produces output like "Front brake pads (for example, Brembo Ceramic Pads for Toyota Corolla 2016–2019)" or "Brake rotor resurfacing service" could use synthetically generated data to generate rephrasing of the same question to test for all the different ways that users might phrase something. For example, the following phrases could all refer to the same thing:
"Brakes squeal when stopping."
"Weird sound from the brakes."
"My car screeches when I hit the brake pedal."
You could also generate more complex scenarios to check how your system responds. For example, what happens if a user asks multiple questions at once? An example of this is "I need new tires for a 2015 Subaru Outback, but I also want to know how often to rotate them."
For all use cases, you need to ensure that the LLM responds appropriately, is consistent with its responses, and understands nuance.
Okareo's fast synthetic data generation tool
Okareo is a platform for evaluating LLMs and other machine learning models. To use Okareo, you create test scenarios (input data sometimes paired with expected outputs) and checks (metrics by which you want to evaluate the LLM, such as coherence, relevance, consistency, or more nebulous concepts like "how friendly the LLM's response was"). You then register your LLM with Okareo and run an evaluation on it using the scenarios and checks.
Generating synthetic data scenarios in-app
To generate synthetic data scenarios in Okareo, you can use the Synthetic Scenario Copilot. Describe the type of data you'd like to be generated, such as "Generate a five-sentence town hall meeting transcription about parking in East Anaheim and a resulting summary that is no longer than one sentence." Next, choose the number of rows of data you want, and Okareo will generate a bunch of scenarios on that theme. The app provides some suggestions for the type of descriptions that work well for testing classification models, the retrieval part of RAG systems or generative models which you can then modify for your own purposes.
You can use the same feature to generate different versions of the same scenarios, such as rephrased or misspelled inputs, or those that use contractions. Check the boxes of the rows you want to change, select from the options that appear below the table (such as "Rephrasing," "Misspellings" etc.), and then click the button at the bottom right to modify your scenarios. Finally, don't forget to save your newly generated scenario versions so they can be passed into an LLM evaluation.
Generating synthetic data in your code
Okareo also has TypeScript and Python SDKs for running LLM evaluations in code. To run this code locally, you'll need to download the Okareo CLI and follow the instructions to export any relevant environment variables and initialize an Okareo project.
This example assumes you're using the TypeScript SDK. The full working code for this example is available on our GitHub. We'll be using the product recommendation example from above.
Start by creating an initial seed scenario, which will become the basis from which you can generate variant scenarios — like rephrased or misspelled versions of the same queries. Add some test cases that include a sample user query and its corresponding expected result (or best example of a suitable response).
const INITIAL_SEED_DATA = [
SeedData({
input:"I want a waterproof smartwatch with GPS for running, under $250",
result:"The Garmin Forerunner 45 ($180) is a great option, offering GPS tracking, waterproofing, and running-specific features. Alternatively, the Amazfit Bip U Pro ($69) provides similar functionality at a lower price."
}),
SeedData({
input:"I need noise-canceling headphones for under $150",
result:"The Sony WH-CH710N ($130) offers effective noise cancellation, up to 35 hours of battery life, and a comfortable design. For a more compact option, consider the Anker Soundcore Life Q30 ($79) with hybrid noise cancellation."
}),
SeedData({
input:"I’m looking for an eco-friendly yoga mat that’s not too expensive.",
result:"The Gaiam Cork Yoga Mat ($49) is a sustainable option made with cork and TPE. Another great choice is the Liforme Travel Mat ($120), which uses biodegradable materials and offers excellent grip."
}),
SeedData({
input:"What’s a good beginner’s acoustic guitar for under $200?",
result:"The Yamaha FG800 ($199) is a highly recommended beginner guitar with excellent tone and durability. If you’re looking for something smaller, the Fender FA-15 ($149) is a great option."
})
];
const seed_scenario: any = await okareo.create_scenario_set(
{
name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
project_id: project_id,
seed_data: INITIAL_SEED_DATA
}
);
From this, you can generate a synthetic data scenario of similar queries. The example below will generate five different misspelled versions for each of the four existing scenarios. To generate rephrased versions instead of misspelled ones, the generation type would need to be ScenarioType.REPHRASE_INVARIANT
. The different scenario types that Okareo can generate are listed here.
const misspelled_scenario: any = await okareo.generate_scenario_set(
{
project_id: project_id,
name: `${SCENARIO_SET_NAME} Misspelled Scenario Set - ${UNIQUE_BUILD_ID}`,
source_scenario_id: seed_scenario.scenario_id,
number_examples: 5,
generation_type: ScenarioType.COMMON_MISSPELLINGS,
}
)
At this stage you can view your newly generated scenarios in the Okareo app. You can see that the existing four seed scenarios are included in the new scenario set, which means it's possible to use the techniques above to chain together different types of generated scenarios and then run your evaluation on a huge set of scenarios with different variants.
Next, you need to register your model with Okareo, passing in any templates or system prompts.
const USER_PROMPT_TEMPLATE = "{scenario_input}"
const RECOMMENDATION_CONTEXT_TEMPLATE = "You are an intelligent product recommendation assistant. Your job is to recommend products based on user queries. The user will describe their needs or preferences, and you will suggest 1-3 suitable products. Each recommendation should include the product name, a brief description, key features that match the query, and the price (if provided or relevant). If no exact match is available, suggest the closest alternatives that fulfill most of the user's requirements."
const model = await okareo.register_model({
name: MODEL_NAME,
tags: ["TAG_NAME"],
project_id: project_id,
models: {
type: "openai",
model_id:"gpt-3.5-turbo",
temperature:0.5,
system_prompt_template:RECOMMENDATION_CONTEXT_TEMPLATE,
user_prompt_template:USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});
Then you need to call the run_test function which runs the evaluation, while passing in a series of checks. In this example we're using standard pre-baked Okareo checks, but it's also possible to define your own.
const eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: `${MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: misspelled_scenario,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: [
"coherence",
"consistency",
"fluency",
"relevance"
]
} as RunTestProps);
It's also worth setting up a GenerationReporter
object, which will report on which metrics pass or fail when your code is run. You can set minimum thresholds for each metric you're testing to ensure behavior doesn't regress every time you evaluate the LLM, and the reporter can announce whether your evaluation passed or failed according to each metric.
const report_definition = {
metrics_min: {
"coherence": 4.8,
"consistency": 4.8,
"fluency": 4.8,
"relevance": 4.8,
}
};
const reporter = new GenerationReporter({
eval_run :eval_run,
...report_definition,
});
reporter.log();
Finally, you can run your code with the okareo run
command, or you can make your Okareo evaluations part of a test suite such as Jest and run them along with other app tests. The results of your evaluation can then be viewed in the Okareo app, or on the command line:
The GenerationReporter showing which metrics did not pass and why for a failed evaluation run
Viewing the results in the Okareo app can help give you an overview so you can get a picture for what information is available. The larger scores for relevance, fluency, etc. near the graphs at the top are an average across the scenario set. You can also see these scores per row. The third row is selected, which has high scores for relevance, fluency and coherence. You can see the actual result is good quality and it's possible to click on the expected result and compare the two.
The results of an evaluation in the Okareo app
If your results aren't high enough, you may need to do some fine-tuning or consider how changes to the prompt might improve the results.
Use Okareo to generate synthetic data for your LLM evaluations
As an LLM app developer, you need to be sure that your system works as expected and can handle unusual inputs. Synthetic data generation helps by generating a bunch of different scenarios and edge cases, which can then be used to evaluate your LLM in a comprehensive way. You can then choose to tune your internal system prompt or fine tune your model if needed.
To start evaluating your LLM and easily generating test data for scenarios, try Okareo today.