Bootstrapping and improving LLM fine-tuning

Language models behave in unpredictable ways. This can be charming if you’re messing around with ChatGPT, but in high stakes situations, a large language model (LLM) that goes off script can be reputationally or materially harmful. To avoid such gaffes, we want to ensure that models deployed in production behave predictably and factually. In this post, we show you how to use instruction fine-tuning to improve your LLM's performance and how Okareo can accelerate your fine-tuning workflow.

Why fine-tune when you can prompt?

The first approach to improving predictability and accuracy of a model is prompt engineering, which is done by iterating between tweaking the model’s system prompt and assessing the LLM’s performance on a fixed, baseline scenario. Prompt engineering in this fashion is often sufficient, but this can often result in large, cumbersome prompts consisting of numerous tokens, eating up the LLM’s context window.

Enter fine-tuning – a form of supervised training which uses well-labeled examples to tune an LLM to achieve specific behaviors. The key idea here is that instead of exhaustively describing the desired behavior the LLM should follow (à la prompt engineering), we are teaching the LLM how to behave by example. In addition to better performance, this method can yield in much smaller system prompts that consume a relatively small portion of the context window.

As a form of supervised learning, fine-tuning relies on the availability of an instruction set, which consists of curated input/output pairs preceded by a task description. In practice, getting access to such data can be time-consuming and expensive (e.g., hiring human annotators to sift through production data). Instead of waiting on production data and labelers to start fine-tuning, we can leverage synthetic data for our initial instruction set to bootstrap the data collection process.

In this blog post, we will demonstrate how you can use Okareo at all stages of fine-tuning, from bootstrapping your instruction set, to evaluating your fine-tuned model’s performance, and finally to augmenting your instruction set in a systematic way to improve your fine-tuned model’s performance.

Case Study: Fine-tuning Phi-3 for Intent Detection

Key Ideas: Okareo synthetic generation and evaluations; Train/test splits; Parameter-efficient fine-tuning; Data augmentation

The guide is an outline of our fine-tuning example notebooks (Part 1 and Part 2). To follow along, sign up for Okareo and get your API token now!

To frame our fine-tuning workflow, let’s consider a RAG architecture for answering questions about a hypothetical online retailer, WebBizz. To help route user questions to the proper database, we will fine-tune an intent detection model to assign the right label to the question.

To fine-tune our intent detection LLM, we take the following approach:

Generate user questions based on WebBizz articles using the Okareo reverse question generator
Split the synthetic questions into train/test splits
Format the train split as an instruction set
Finetune Phi-3 on the train split
Evaluate the fine-tuned LLM on the train/test splits in an Okareo classification evaluation
Construct an augmented train split by using the Okareo rephrasing generator on misclassified rows
Finetune Phi-3 on the augmented train split
Compare the augmented fine-tuned LLM’s train/test performance to the original fine-tuned model

1. Generate User Questions

Our RAG will have a database of WebBizz articles to use as context for generated answers. Before making fine-tuning instructions, we need representative questions that a WebBizz customer might ask.

To get a set of such questions, we will use the reverse question generator in Okareo. The basic idea is that given a set of articles, the generator produces one or more relevant questions that are answered by the article.

Example of synthetic questions generated on a sustainability-related WebBizz article.

2. Make Train/Test Splits

In conventional ML development, we have the notion of a train split and a test split. The train split is used to calculate the model’s loss, and this loss is used during backpropagation to update the weights of the model. The test split is used to evaluate the model’s loss on data that the model has not “seen” during training. Holding out a test split allows us to have more confidence in the model’s performance on new data that the trained model has not seen (i.e., production data).

We can apply the same principle of train/test splits to LLM fine-tuning. In practice, we use scikit-learn’s StratifiedShuffleSplit to ensure that the intent class distribution is consistent between the two splits.

Example of splitting synthetic questions into train and test.

3. Format the train split as an instruction split

Now that we have our splits, we can format our questions as fine-tuning instructions. Generally speaking, instructions should be formatted with the following three fields:

Instruction: Description of the task, input/output format, etc. (i.e., intent detection)
Input: Text used to prompt the LLM (i.e., user question)
Output: Expected response from the LLM (i.e., predicted intent)

For our intent detection task, we adopt the instruction template pictured below.

4. Fine-tune Phi-3 on the formatted train split

We will fine-tune Phi-3-mini-4k-instruct, a 3.8B parameter open source LLM from Microsoft, to perform intent detection on our user questions. To perform the fine-tuning, we provisioned a GCP VM with a single Nvidia L4 GPU.

To fine-tune the model in Python, we started with the excellent tutorial by Phillip Schmid of Hugging Face. He demonstrates how to fine-tune Llama2 using several parameter-efficient fine-tuning techniques, including LoRA, 4-bit quantization, and flash attention, which all help the LLM fit in GPU memory.

5. Evaluate the fine-tuned LLM in Okareo

After fine-tuning Phi-3, we used Okareo to perform a classification evaluation. In brief, this process involved:

Uploading the train/test splits as scenarios
Uploading the fine-tuned Phi-3 as a CustomModel
Starting test runs of the CustomModel on each scenario

On the “Published” tab of Okareo, we can compare different evaluation runs against each other. The figure below shows key classification metrics for the fine-tuned model as reported by Okareo.

Okareo Model Cards showing fine-tuned Phi-3 performance on the WebBizz train/test splits.

We can also click into one of these evaluations to get more granular information about the evaluation run, like the confusion matrix.

Okareo Evaluation showing topline metrics and the confusion matrix of Phi-3 on the test split.

6. Augment the train split with misclassified rows

Given our fixed train split, how can we quickly marshall more data to improve our fine-tuned model’s performance? In this case, we will use the results of our classification evaluation to help us guide this process. More specifically, we will do the following:

Filter the train split based on incorrectly classified questions
Use the rephrasing generator on these questions
Make an “augmented” train split with the rephrased questions

Strategy for augmenting training data for our fine-tuning workflow

7. Repeat fine-tuning with the augmented split

This step is identical to step #3, but this time we use the augmented train split to craft our instruction set.

8. Compare augmented LLM performance

After uploading the augmented Phi-3 model to Okareo, we can compare its classification metrics against the original fine-tuned model. We observe that the augmented model’s performance improves across all metrics for both splits!

Conclusion

In this post, we started by comparing prompt engineering and fine-tuning. Then we demonstrated how we can use Okareo’s synthetic data generators and classification evaluations to accelerate fine-tuning of an intent detection model for a RAG system. In a future post, we will showcase how Okareo can be used to help you fine-tune and evaluate other stages of your RAG pipeline, including retrieval and generation.