Using Synthetic Data to Improve Test Coverage and Performance
Agentics
Stefan Broecker
,
Machine Learning Engineer
August 2, 2024
Machine learning practitioners often face a common problem when trying to train and evaluate a model: not enough high-quality, labeled data. During training, a limited dataset can lead to a brittle model, one that behaves unexpectedly when presented with novel data. During evaluation, a limited dataset means that you won’t catch that unexpected behavior until it’s too late. This post is going to make the case that synthetic data can help alleviate both of those problems, improving both model training and evaluation. By walking through a simple use case, we’ll show how synthetic data can help build a more robust model, and we’ll show you what we even mean when we say synthetic data.
Training a model
To start, the model we’re going to be training and evaluating is an intent classification model. Intent classification models are often used to route customer questions to appropriate customer support departments. For example, the question “Can I return online purchases to a physical store location?” might be directed to the Returns department. Our model will be classifying questions into three categories: returns, complaints, and pricing.
The model itself will be a BERT model, and we’ll be training it on a small sample of human-generated and human-labeled customer support questions, randomly split into training and testing sets. Full details of training are in this repository.
Without too much work, we were able to create an intent classification model, which we’ll call our Base model, that’s more than 85% accurate on our testing data. Pretty good!
Synthetic Data for Evaluation
But wait, our test set only has about 20 samples. How representative can that really be? Let’s generate some synthetic data to find out.
To do that, we’re going to use some of Okareo’s generators (you can find those here). We’ll start with the Rephrasing generator. This generator takes an example piece of text and rewords it while keeping the same meaning. If we use our question about physical stores as an example, the Rephrasing generator might create “Is it possible to take online purchases back to a brick-and-mortar store?”
Next, we’ll use the Conditional generator, which rewords a question to emphasize a particular clause. Continuing our example, the Conditional generator might produce “When dealing with returns for online purchases, are physical store locations an option?”
Then we’ll generate some harder synthetic data by creating examples with the Misspellings generator, which intentionally misspells words. This might produce something like “Can I return online purchases to a phyaical store location?”
And finally we’ll use the Contractions generator to shorten words in a (hopefully) human-like way. An example of that might be “Can I return nline purchses to a physical store location?”
We’ll use the generators to make three new examples out of each of our original testing examples, and then test our model on the new data.
Look at that! Our Base model looks even better! When evaluating on rephrased and conditional questions, the Base model is even better than we thought, achieving almost 94% accuracy.
The problem is when we start looking at questions that fall a little farther outside of our test set. If a user misspells words or uses contractions, the Base model can’t classify their question as reliably. The Base model only reaches about 80% accuracy on misspelled questions and about 76% accuracy when contractions are added.
And there we go! With just a few variations of our original test data, we learned more about the performance of our Base model. We now know that its performance suffers if a question uses irregular language, something we couldn’t have known from the original test data. The fact that we were able to pinpoint some areas that our Base model does poorly on is already a testament to the value of synthetic data.
Synthetic Data for Training
Being aware of the shortcomings of our model is one thing. Can we do anything about it? Glad you asked.
Next we’re going to look at how we can use very similar synthetic generation techniques to build a more robust model.
We’ll use the same four generators as before, but this time we’ll generate new training data instead of new testing data. Then, we’ll train a new model using our original training data and our new synthetic training data. We’ll call this the Synthetic model. Full details of training the Synthetic model are here.
Now, let’s compare how the Synthetic model stacks up against the Base model. We’ll start with comparing their performance on the synthetic testing data we generated earlier. First, let’s look at the conditional statements, which the base model was already good at classifying.
The Synthetic model is even better! Now let’s look at the rephrased questions.
Again, even better! But what about the synthetic data that the Base model wasn’t as good at? Here’s the data with contractions.
Another improvement! And here’s the data with misspellings.
A huge improvement!
But what about the data we really care about? How did the Synthetic model do on the human-generated test set?
Almost a 10 percentage point improvement in accuracy!
Wrapping Up
As we’ve seen, synthetic data can be a powerful tool for both evaluating and training a model. Without getting new human-generated data, something that can be both time consuming and costly, we were able to expand the scope of our evaluation and train a new model that improved upon the shortcomings that our evaluation uncovered.
Machine learning practitioners often face a common problem when trying to train and evaluate a model: not enough high-quality, labeled data. During training, a limited dataset can lead to a brittle model, one that behaves unexpectedly when presented with novel data. During evaluation, a limited dataset means that you won’t catch that unexpected behavior until it’s too late. This post is going to make the case that synthetic data can help alleviate both of those problems, improving both model training and evaluation. By walking through a simple use case, we’ll show how synthetic data can help build a more robust model, and we’ll show you what we even mean when we say synthetic data.
Training a model
To start, the model we’re going to be training and evaluating is an intent classification model. Intent classification models are often used to route customer questions to appropriate customer support departments. For example, the question “Can I return online purchases to a physical store location?” might be directed to the Returns department. Our model will be classifying questions into three categories: returns, complaints, and pricing.
The model itself will be a BERT model, and we’ll be training it on a small sample of human-generated and human-labeled customer support questions, randomly split into training and testing sets. Full details of training are in this repository.
Without too much work, we were able to create an intent classification model, which we’ll call our Base model, that’s more than 85% accurate on our testing data. Pretty good!
Synthetic Data for Evaluation
But wait, our test set only has about 20 samples. How representative can that really be? Let’s generate some synthetic data to find out.
To do that, we’re going to use some of Okareo’s generators (you can find those here). We’ll start with the Rephrasing generator. This generator takes an example piece of text and rewords it while keeping the same meaning. If we use our question about physical stores as an example, the Rephrasing generator might create “Is it possible to take online purchases back to a brick-and-mortar store?”
Next, we’ll use the Conditional generator, which rewords a question to emphasize a particular clause. Continuing our example, the Conditional generator might produce “When dealing with returns for online purchases, are physical store locations an option?”
Then we’ll generate some harder synthetic data by creating examples with the Misspellings generator, which intentionally misspells words. This might produce something like “Can I return online purchases to a phyaical store location?”
And finally we’ll use the Contractions generator to shorten words in a (hopefully) human-like way. An example of that might be “Can I return nline purchses to a physical store location?”
We’ll use the generators to make three new examples out of each of our original testing examples, and then test our model on the new data.
Look at that! Our Base model looks even better! When evaluating on rephrased and conditional questions, the Base model is even better than we thought, achieving almost 94% accuracy.
The problem is when we start looking at questions that fall a little farther outside of our test set. If a user misspells words or uses contractions, the Base model can’t classify their question as reliably. The Base model only reaches about 80% accuracy on misspelled questions and about 76% accuracy when contractions are added.
And there we go! With just a few variations of our original test data, we learned more about the performance of our Base model. We now know that its performance suffers if a question uses irregular language, something we couldn’t have known from the original test data. The fact that we were able to pinpoint some areas that our Base model does poorly on is already a testament to the value of synthetic data.
Synthetic Data for Training
Being aware of the shortcomings of our model is one thing. Can we do anything about it? Glad you asked.
Next we’re going to look at how we can use very similar synthetic generation techniques to build a more robust model.
We’ll use the same four generators as before, but this time we’ll generate new training data instead of new testing data. Then, we’ll train a new model using our original training data and our new synthetic training data. We’ll call this the Synthetic model. Full details of training the Synthetic model are here.
Now, let’s compare how the Synthetic model stacks up against the Base model. We’ll start with comparing their performance on the synthetic testing data we generated earlier. First, let’s look at the conditional statements, which the base model was already good at classifying.
The Synthetic model is even better! Now let’s look at the rephrased questions.
Again, even better! But what about the synthetic data that the Base model wasn’t as good at? Here’s the data with contractions.
Another improvement! And here’s the data with misspellings.
A huge improvement!
But what about the data we really care about? How did the Synthetic model do on the human-generated test set?
Almost a 10 percentage point improvement in accuracy!
Wrapping Up
As we’ve seen, synthetic data can be a powerful tool for both evaluating and training a model. Without getting new human-generated data, something that can be both time consuming and costly, we were able to expand the scope of our evaluation and train a new model that improved upon the shortcomings that our evaluation uncovered.
Machine learning practitioners often face a common problem when trying to train and evaluate a model: not enough high-quality, labeled data. During training, a limited dataset can lead to a brittle model, one that behaves unexpectedly when presented with novel data. During evaluation, a limited dataset means that you won’t catch that unexpected behavior until it’s too late. This post is going to make the case that synthetic data can help alleviate both of those problems, improving both model training and evaluation. By walking through a simple use case, we’ll show how synthetic data can help build a more robust model, and we’ll show you what we even mean when we say synthetic data.
Training a model
To start, the model we’re going to be training and evaluating is an intent classification model. Intent classification models are often used to route customer questions to appropriate customer support departments. For example, the question “Can I return online purchases to a physical store location?” might be directed to the Returns department. Our model will be classifying questions into three categories: returns, complaints, and pricing.
The model itself will be a BERT model, and we’ll be training it on a small sample of human-generated and human-labeled customer support questions, randomly split into training and testing sets. Full details of training are in this repository.
Without too much work, we were able to create an intent classification model, which we’ll call our Base model, that’s more than 85% accurate on our testing data. Pretty good!
Synthetic Data for Evaluation
But wait, our test set only has about 20 samples. How representative can that really be? Let’s generate some synthetic data to find out.
To do that, we’re going to use some of Okareo’s generators (you can find those here). We’ll start with the Rephrasing generator. This generator takes an example piece of text and rewords it while keeping the same meaning. If we use our question about physical stores as an example, the Rephrasing generator might create “Is it possible to take online purchases back to a brick-and-mortar store?”
Next, we’ll use the Conditional generator, which rewords a question to emphasize a particular clause. Continuing our example, the Conditional generator might produce “When dealing with returns for online purchases, are physical store locations an option?”
Then we’ll generate some harder synthetic data by creating examples with the Misspellings generator, which intentionally misspells words. This might produce something like “Can I return online purchases to a phyaical store location?”
And finally we’ll use the Contractions generator to shorten words in a (hopefully) human-like way. An example of that might be “Can I return nline purchses to a physical store location?”
We’ll use the generators to make three new examples out of each of our original testing examples, and then test our model on the new data.
Look at that! Our Base model looks even better! When evaluating on rephrased and conditional questions, the Base model is even better than we thought, achieving almost 94% accuracy.
The problem is when we start looking at questions that fall a little farther outside of our test set. If a user misspells words or uses contractions, the Base model can’t classify their question as reliably. The Base model only reaches about 80% accuracy on misspelled questions and about 76% accuracy when contractions are added.
And there we go! With just a few variations of our original test data, we learned more about the performance of our Base model. We now know that its performance suffers if a question uses irregular language, something we couldn’t have known from the original test data. The fact that we were able to pinpoint some areas that our Base model does poorly on is already a testament to the value of synthetic data.
Synthetic Data for Training
Being aware of the shortcomings of our model is one thing. Can we do anything about it? Glad you asked.
Next we’re going to look at how we can use very similar synthetic generation techniques to build a more robust model.
We’ll use the same four generators as before, but this time we’ll generate new training data instead of new testing data. Then, we’ll train a new model using our original training data and our new synthetic training data. We’ll call this the Synthetic model. Full details of training the Synthetic model are here.
Now, let’s compare how the Synthetic model stacks up against the Base model. We’ll start with comparing their performance on the synthetic testing data we generated earlier. First, let’s look at the conditional statements, which the base model was already good at classifying.
The Synthetic model is even better! Now let’s look at the rephrased questions.
Again, even better! But what about the synthetic data that the Base model wasn’t as good at? Here’s the data with contractions.
Another improvement! And here’s the data with misspellings.
A huge improvement!
But what about the data we really care about? How did the Synthetic model do on the human-generated test set?
Almost a 10 percentage point improvement in accuracy!
Wrapping Up
As we’ve seen, synthetic data can be a powerful tool for both evaluating and training a model. Without getting new human-generated data, something that can be both time consuming and costly, we were able to expand the scope of our evaluation and train a new model that improved upon the shortcomings that our evaluation uncovered.
Machine learning practitioners often face a common problem when trying to train and evaluate a model: not enough high-quality, labeled data. During training, a limited dataset can lead to a brittle model, one that behaves unexpectedly when presented with novel data. During evaluation, a limited dataset means that you won’t catch that unexpected behavior until it’s too late. This post is going to make the case that synthetic data can help alleviate both of those problems, improving both model training and evaluation. By walking through a simple use case, we’ll show how synthetic data can help build a more robust model, and we’ll show you what we even mean when we say synthetic data.
Training a model
To start, the model we’re going to be training and evaluating is an intent classification model. Intent classification models are often used to route customer questions to appropriate customer support departments. For example, the question “Can I return online purchases to a physical store location?” might be directed to the Returns department. Our model will be classifying questions into three categories: returns, complaints, and pricing.
The model itself will be a BERT model, and we’ll be training it on a small sample of human-generated and human-labeled customer support questions, randomly split into training and testing sets. Full details of training are in this repository.
Without too much work, we were able to create an intent classification model, which we’ll call our Base model, that’s more than 85% accurate on our testing data. Pretty good!
Synthetic Data for Evaluation
But wait, our test set only has about 20 samples. How representative can that really be? Let’s generate some synthetic data to find out.
To do that, we’re going to use some of Okareo’s generators (you can find those here). We’ll start with the Rephrasing generator. This generator takes an example piece of text and rewords it while keeping the same meaning. If we use our question about physical stores as an example, the Rephrasing generator might create “Is it possible to take online purchases back to a brick-and-mortar store?”
Next, we’ll use the Conditional generator, which rewords a question to emphasize a particular clause. Continuing our example, the Conditional generator might produce “When dealing with returns for online purchases, are physical store locations an option?”
Then we’ll generate some harder synthetic data by creating examples with the Misspellings generator, which intentionally misspells words. This might produce something like “Can I return online purchases to a phyaical store location?”
And finally we’ll use the Contractions generator to shorten words in a (hopefully) human-like way. An example of that might be “Can I return nline purchses to a physical store location?”
We’ll use the generators to make three new examples out of each of our original testing examples, and then test our model on the new data.
Look at that! Our Base model looks even better! When evaluating on rephrased and conditional questions, the Base model is even better than we thought, achieving almost 94% accuracy.
The problem is when we start looking at questions that fall a little farther outside of our test set. If a user misspells words or uses contractions, the Base model can’t classify their question as reliably. The Base model only reaches about 80% accuracy on misspelled questions and about 76% accuracy when contractions are added.
And there we go! With just a few variations of our original test data, we learned more about the performance of our Base model. We now know that its performance suffers if a question uses irregular language, something we couldn’t have known from the original test data. The fact that we were able to pinpoint some areas that our Base model does poorly on is already a testament to the value of synthetic data.
Synthetic Data for Training
Being aware of the shortcomings of our model is one thing. Can we do anything about it? Glad you asked.
Next we’re going to look at how we can use very similar synthetic generation techniques to build a more robust model.
We’ll use the same four generators as before, but this time we’ll generate new training data instead of new testing data. Then, we’ll train a new model using our original training data and our new synthetic training data. We’ll call this the Synthetic model. Full details of training the Synthetic model are here.
Now, let’s compare how the Synthetic model stacks up against the Base model. We’ll start with comparing their performance on the synthetic testing data we generated earlier. First, let’s look at the conditional statements, which the base model was already good at classifying.
The Synthetic model is even better! Now let’s look at the rephrased questions.
Again, even better! But what about the synthetic data that the Base model wasn’t as good at? Here’s the data with contractions.
Another improvement! And here’s the data with misspellings.
A huge improvement!
But what about the data we really care about? How did the Synthetic model do on the human-generated test set?
Almost a 10 percentage point improvement in accuracy!
Wrapping Up
As we’ve seen, synthetic data can be a powerful tool for both evaluating and training a model. Without getting new human-generated data, something that can be both time consuming and costly, we were able to expand the scope of our evaluation and train a new model that improved upon the shortcomings that our evaluation uncovered.