How to Evaluate a Fine-tuned LLM
Evaluation

Matt Wyman
,
CEO and Co-Founder

Rachael Churchill
,
Technical Content Writer
January 13, 2025
Fine-tuning a large language model makes it more suited to a particular task or use case. When you fine-tune a model, you’re not retraining it from scratch — you’re creating a new model with updated weights based on a set of specialized training data that’s relatively small compared with the original training data. For example, the new training data may be specialist material on a technical topic, or it may be text written in a particular style you want to get your LLM to write in.
Because fine-tuning creates a new model with updated weights, it’s different from other approaches that work by prompting an existing, unchanged model. These other approaches can include prompt engineering, few-shot prompting (teaching the LLM to perform a specific task by giving it a few examples in the prompt), or retrieval-augmented generation (RAG), which works by supplying context in the prompt.
When should you use fine-tuning?
Fine-tuning is suitable for customizing an LLM to a particular dataset or to a particular genre or style of text. You can use it to make your LLM a specialist or expert on a particular topic, whether that’s ancient Chinese history, linear algebra, or the contents of your company’s product catalog.
You can also use fine-tuning to teach your model particular capabilities that the base model is not able to do (or not able to do reliably) but that you can demonstrate by example. This can include demonstrating by negative example, if the base model keeps doing something you want it to avoid.
Of course, it’s also possible to do this kind of teaching by example purely within the prompt, so you shouldn’t always jump straight into fine-tuning. It might be better to experiment with prompt engineering first, as that could solve the problem by itself, is simpler to do, and has a faster feedback loop. Also, fine-tuning permanently updates the model, which is what you want for some use cases, but not for others. If your model needs to stay general purpose and only apply specialist domain knowledge for some kinds of query, it’s probably better to supply it with that knowledge in the prompts for those queries rather than updating it permanently via fine-tuning.
However, if you are comfortable having a custom version of the model in order to resolve performance and accuracy, fine-tuning is the way to go. Fine-tuning can also improve on what you could do with few-shot learning via a prompt. For example, it lets you give the model more examples than can sensibly fit in a prompt. And if the specialist content is used for fine-tuning in advance rather than added to the prompt at query time for every query, you save on token cost and latency.
It used to be that you could only fine-tune models you actually stored and controlled on your own servers, but now providers like OpenAI will allow you to fine-tune some of their hosted models. This lets you experiment with fine-tuning for a much lower cost of entry and ongoing cost.
Why and how to evaluate a fine-tuned LLM
After you’ve fine-tuned your model, you’ll want to know if it performs better than your original base model. Comparing two models requires evaluating each of them according to some metrics that you decide, which will depend on the goals of your LLM app.
You could evaluate how coherent or how friendly the response is, or how relevant to the input it is. These subjective kinds of metrics can be evaluated by a human or by another LLM acting as a judge, and they can be quantified by scoring them on a scale. Alternatively, you can apply binary deterministic metrics: for example, if your LLM outputs code, you could check whether that code contains all the necessary import statements (from a predefined list). Run the same set of checks on both the original base model and the fine-tuned model and do quantitative comparisons.
Evaluating a fine-tuned LLM is an iterative process (although, because you’re updating the model, it’s slower and more complex than doing a similar iterative process with just refining the prompts). You can keep fine-tuning and re-evaluating, plot the changes or improvements in the metrics you care about, and identify where you run into diminishing returns and should stop.
Tutorial: How to evaluate a fine-tuned LLM using Okareo
The tutorial below shows you how to use Okareo to evaluate an LLM that’s been fine-tuned to answer in a particular style — specifically, that of Shakespeare. All the scripts are available on GitHub. You will need to download the Okareo CLI, which supports flow scripts written in Python or TypeScript. The tutorial below uses TypeScript.
1. Create example scenarios
First you need to create scenarios (questions and sample answers) that you want to evaluate both models (base and fine-tuned) with. You don’t need to create huge numbers of these. You can provide a handful of seed scenarios and have Okareo generate more.
Here’s an excerpt from an Okareo flow script showing part of an Okareo scenario set containing three simple questions that could be asked of an LLM.
The result field can be empty if you’re only going to apply reference-free metrics — ones that evaluate the style in which the responses are written, rather than comparing the responses with any gold-standard example answers.
const SCENARIO_SET_DATA = [
SeedData({
input:"How do I boil an egg?",
result:""
}),
SeedData({
input:"Why is the sky blue?",
result:""
}),
SeedData({
input:"What are the main anatomical features of a flower and what are they for?",
result:""
})
];
You can register this scenario set with Okareo by adding this code to your flow script:
const scenario: any = await okareo.create_scenario_set(
{
name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
project_id: project_id,
seed_data: SCENARIO_SET_DATA
}
);
2. Create custom checks
Okareo evaluates LLMs by applying checks. It includes some built-in checks, such as how coherent the output is. It also allows you to define your own custom checks in natural language, which it will then use to prompt another LLM acting as a judge to perform the evaluations.
The code below defines checks for how archaic and how poetic the language is (so that you can use them to evaluate the fine-tuned Shakespeare LLM against the base model).
// define new custom checks by prompting an LLM judge
const custom_checks: CHECK_TYPE[] = [
{
name:"custom.Archaic",
description: "Use LLM-as-a-judge to rate how archaic the output is.",
prompt: archaicPrompt,
output_data_type: CheckOutputType.SCORE,
},
{
name:"custom.Poetic",
description: "Use LLM-as-a-judge to rate how poetic the output is.",
prompt: poeticPrompt,
output_data_type: CheckOutputType.SCORE,
},
];
// register custom checks with Okareo
register_checks(okareo, project_id, custom_checks);
The prompt text in the archaicPrompt
variable is as follows:
Output a score for how archaic the writing style is on a scale of 1 (most modern) to 5 (most archaic).
Model answer:
{generation}
Archaic Writing Style Score:
and poeticPrompt
contains the following prompt text:
Output a score for how poetic the writing style is on a scale of 1 (most prosaic and ordinary) to 5 (most poetic). Poetic devices can include rhythm and meter, simile and metaphor, rhetorical questions, repetition, alliteration, elevated diction, and so on.
Model Answer:
{generation}
Poetic Writing Style Score:
Note the {generation}
tags in the prompts. Okareo uses these to insert the output from the LLM under test into the prompt being supplied to the LLM judge.
3. Register the base model with Okareo
Here’s how to register a model (in this case, OpenAI’s gpt-3.5-turbo-0125
) with Okareo. USER_PROMPT_TEMPLATE
refers to the input of each scenario in the scenario set defined in step 1 — this gets dynamically loaded later when the model is evaluated. UNIQUE_BUILD_ID
and BASE_MODEL_NAME
are also defined earlier in the flow script.
const USER_PROMPT_TEMPLATE = "{scenario_input}";
const model = await okareo.register_model({
name:BASE_MODEL_NAME,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
models: {
type: "openai",
model_id:"gpt-3.5-turbo-0125",
temperature:0.5,
user_prompt_template:USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});
4. Create fine-tuning data
If you don’t already have training data to fine-tune with, split the scenario set randomly into training data and test data, and express the training data in the form your chosen LLM platform requires for fine-tuning.
For example, OpenAI requires fine-tuning data as a set of examples in JSONL (JSON with newlines) that each contain one or more role fields with corresponding content
fields. The assistant
role is required, as it defines what the model should reply with. You can also optionally include a user
role, which is the prompt from the user, and/or a system
role, which is a system prompt.
{"messages":
[{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."},
{"role": "user", "content": "What's the capital of France?"},
{"role": "assistant","content": "Paris, as if everyone doesn't know that already."}]
}
This example is focused on fine-tuning the style of the model’s responses, rather than the content of its responses to particular questions, so only the assistant role is needed.
To recreate the JSONL used in this tutorial:
Download the complete works of Shakespeare from Project Gutenberg in plain text format.
Manually delete the front and end matter and divide the file into three sections: plays, sonnets, and other verse. These three files are provided on our GitHub.
Use the split_shakespeare.py script to split the content into lots of small chunks.
Use the create_ft_file.py script to convert these chunks to JSONL in the correct format for uploading to OpenAI (as in the “sarcastic chatbot” example above).
Use the validation script provided by OpenAI to validate the JSONL file.
5. Fine-tune the model
Fine-tune the model using the correctly-formatted training data showing the style or capability you want (or the training data from the split). This is easier now that providers like OpenAI allow users to fine-tune hosted models. Full details on how to use OpenAI’s API to fine-tune one of their models are here, and there are more details about uploading files to OpenAI here.
First, upload your JSONL file (from the previous step) to OpenAI:
curl https://api.openai.com/v1/files \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F purpose="fine-tune" \
-F file="@output.jsonl"
OPENAI_API_KEY
is your OpenAI API key as an environment variable, and output.jsonl (note the preceding @) is the JSONL file you created in the previous step.
This will return a response including a file ID. Note it down.
Next, kick off the fine-tuning job, where YOUR_FILE_ID
is the file ID you noted down:
curl https://api.openai.com/v1/fine_tuning/jobs \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"training_file": YOUR_FILE_ID,
"model": "gpt-3.5-turbo-0125"
}'
This will return a job ID. Note it down too.
The fine-tune will probably take a couple of hours to complete, but you can check progress using the following command, replacing YOUR_JOB_ID
with the job ID you noted down:
curl https://api.openai.com/v1/fine_tuning/jobs/YOUR_JOB_ID/events \
-H "Authorization: Bearer $OPENAI_API_KEY"
6. Register the fine-tuned model with Okareo
Once the fine-tuning job has completed, OpenAI will supply you with a model ID for the fine-tuned model in an email. You can also look it up using the following command, which lists fine-tuning jobs that are in progress or recently completed:
curl https://api.openai.com/v1/fine_tuning/jobs?limit=2 \
-H "Authorization: Bearer $OPENAI_API_KEY"
You can then register that model with Okareo in the same way as you registered the base model:
const ftmodel = await okareo.register_model({
name: FINE_TUNED_MODEL_NAME,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
models: {
type: "openai",
model_id:YOUR_FINE_TUNED_MODEL_ID,
temperature:0.5,
user_prompt_template:USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});
Note that this code has three differences from the code to register the base model: ftmodel, FINE_TUNED_MODEL_NAME
, and YOUR_FINE_TUNED_MODEL_ID
.
7. Evaluate the base model and the fine-tuned model
The following code will run the evaluation on the base model, and then on the fine-tuned model, using the same scenarios and the same checks:
const base_eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: `${BASE_MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: scenario,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: checks
} as RunTestProps);
const ft_eval_run: components["schemas"]["TestRunItem"] = await ftmodel.run_test({
model_api_key: OPENAI_API_KEY,
name: `${FINE_TUNED_MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: scenario,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: checks
} as RunTestProps);
8. Run the flow script
The Okareo flow script ft1.ts
(available in our GitHub repo) includes the steps above: creating a scenario set, defining some checks, and registering the base model and the fine-tuned model, followed by the code to run the evaluation.
You can run this flow from your command line with this command:
okareo run -f ft1
Make sure you don’t have any other flows in your flows
directory whose names contain the string ft1
, because okareo run
will run those too.
9. View and compare the results
To have Okareo output a link for viewing the results in the web app, add this code to your flow script:
// print a direct link to the evaluation report in Okareo (for convenience)
console.log(`See base results in Okareo: ${base_eval_run.app_link}`);
console.log(`See fine-tuned results in Okareo: ${ft_eval_run.app_link}`);
This allows you to compare the results from the base model and the fine-tuned model in an intuitive, visual way.
The Okareo web app has a “Score Cards” tab that allows you to view the results of multiple evaluations side by side and compare them at a glance, like this:

Here you can see that the Archaic and Poetic scores for the base model are consistently very low, but for the fine-tuned model they’re higher (and more variable).
If you’re fine-tuning multiple times in an iterative process, you could also plot a time-series graph of how the output improved over the course of that process (and see the point where the improvement levels off and runs into diminishing returns).
You can also look at the results of individual evaluations in the “Evaluations” tab of the web app and drill down to see the scores for each individual piece of text generated by the model.
And here’s an example of the output from each model so you can see the difference in style for yourself:
Base model:
The sky appears blue because of the way sunlight interacts with Earth's atmosphere. When sunlight reaches Earth, it is made up of a spectrum of colors, with shorter wavelengths like blue and violet being scattered more easily by the gases and particles in the atmosphere. This scattering causes the blue light to be more visible to our eyes, giving the sky its blue color.
Fine-tuned model:
Why is the sky blue? Why is the world
At endless strife with the spirit of man?
Why do the stars in their courses whirl,
And the planets and suns in their cycles plan
The doom of the earth and the fate of man?
Why is the sky blue
As you can see, the fine-tuned model’s output style differs significantly from the base model’s. Okareo provides a way to quantify this difference and apply the comparison in an automated and repeatable way.
Okareo provides versatile options for evaluating fine-tuned LLMs
This tutorial showed how to evaluate a fine-tuned LLM using Okareo. In this case, the LLM was fine-tuned to write in a particular style. For a different kind of example, see this article on using fine-tuning to improve an intent detection LLM and evaluating the difference.
Whatever your use case for fine-tuning, Okareo can help you evaluate the fine-tuned LLM and compare it against the base model. And if you decide to improve your product by refining your system prompt instead of fine-tuning, Okareo can evaluate that too. Sign up for Okareo today.
Fine-tuning a large language model makes it more suited to a particular task or use case. When you fine-tune a model, you’re not retraining it from scratch — you’re creating a new model with updated weights based on a set of specialized training data that’s relatively small compared with the original training data. For example, the new training data may be specialist material on a technical topic, or it may be text written in a particular style you want to get your LLM to write in.
Because fine-tuning creates a new model with updated weights, it’s different from other approaches that work by prompting an existing, unchanged model. These other approaches can include prompt engineering, few-shot prompting (teaching the LLM to perform a specific task by giving it a few examples in the prompt), or retrieval-augmented generation (RAG), which works by supplying context in the prompt.
When should you use fine-tuning?
Fine-tuning is suitable for customizing an LLM to a particular dataset or to a particular genre or style of text. You can use it to make your LLM a specialist or expert on a particular topic, whether that’s ancient Chinese history, linear algebra, or the contents of your company’s product catalog.
You can also use fine-tuning to teach your model particular capabilities that the base model is not able to do (or not able to do reliably) but that you can demonstrate by example. This can include demonstrating by negative example, if the base model keeps doing something you want it to avoid.
Of course, it’s also possible to do this kind of teaching by example purely within the prompt, so you shouldn’t always jump straight into fine-tuning. It might be better to experiment with prompt engineering first, as that could solve the problem by itself, is simpler to do, and has a faster feedback loop. Also, fine-tuning permanently updates the model, which is what you want for some use cases, but not for others. If your model needs to stay general purpose and only apply specialist domain knowledge for some kinds of query, it’s probably better to supply it with that knowledge in the prompts for those queries rather than updating it permanently via fine-tuning.
However, if you are comfortable having a custom version of the model in order to resolve performance and accuracy, fine-tuning is the way to go. Fine-tuning can also improve on what you could do with few-shot learning via a prompt. For example, it lets you give the model more examples than can sensibly fit in a prompt. And if the specialist content is used for fine-tuning in advance rather than added to the prompt at query time for every query, you save on token cost and latency.
It used to be that you could only fine-tune models you actually stored and controlled on your own servers, but now providers like OpenAI will allow you to fine-tune some of their hosted models. This lets you experiment with fine-tuning for a much lower cost of entry and ongoing cost.
Why and how to evaluate a fine-tuned LLM
After you’ve fine-tuned your model, you’ll want to know if it performs better than your original base model. Comparing two models requires evaluating each of them according to some metrics that you decide, which will depend on the goals of your LLM app.
You could evaluate how coherent or how friendly the response is, or how relevant to the input it is. These subjective kinds of metrics can be evaluated by a human or by another LLM acting as a judge, and they can be quantified by scoring them on a scale. Alternatively, you can apply binary deterministic metrics: for example, if your LLM outputs code, you could check whether that code contains all the necessary import statements (from a predefined list). Run the same set of checks on both the original base model and the fine-tuned model and do quantitative comparisons.
Evaluating a fine-tuned LLM is an iterative process (although, because you’re updating the model, it’s slower and more complex than doing a similar iterative process with just refining the prompts). You can keep fine-tuning and re-evaluating, plot the changes or improvements in the metrics you care about, and identify where you run into diminishing returns and should stop.
Tutorial: How to evaluate a fine-tuned LLM using Okareo
The tutorial below shows you how to use Okareo to evaluate an LLM that’s been fine-tuned to answer in a particular style — specifically, that of Shakespeare. All the scripts are available on GitHub. You will need to download the Okareo CLI, which supports flow scripts written in Python or TypeScript. The tutorial below uses TypeScript.
1. Create example scenarios
First you need to create scenarios (questions and sample answers) that you want to evaluate both models (base and fine-tuned) with. You don’t need to create huge numbers of these. You can provide a handful of seed scenarios and have Okareo generate more.
Here’s an excerpt from an Okareo flow script showing part of an Okareo scenario set containing three simple questions that could be asked of an LLM.
The result field can be empty if you’re only going to apply reference-free metrics — ones that evaluate the style in which the responses are written, rather than comparing the responses with any gold-standard example answers.
const SCENARIO_SET_DATA = [
SeedData({
input:"How do I boil an egg?",
result:""
}),
SeedData({
input:"Why is the sky blue?",
result:""
}),
SeedData({
input:"What are the main anatomical features of a flower and what are they for?",
result:""
})
];
You can register this scenario set with Okareo by adding this code to your flow script:
const scenario: any = await okareo.create_scenario_set(
{
name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
project_id: project_id,
seed_data: SCENARIO_SET_DATA
}
);
2. Create custom checks
Okareo evaluates LLMs by applying checks. It includes some built-in checks, such as how coherent the output is. It also allows you to define your own custom checks in natural language, which it will then use to prompt another LLM acting as a judge to perform the evaluations.
The code below defines checks for how archaic and how poetic the language is (so that you can use them to evaluate the fine-tuned Shakespeare LLM against the base model).
// define new custom checks by prompting an LLM judge
const custom_checks: CHECK_TYPE[] = [
{
name:"custom.Archaic",
description: "Use LLM-as-a-judge to rate how archaic the output is.",
prompt: archaicPrompt,
output_data_type: CheckOutputType.SCORE,
},
{
name:"custom.Poetic",
description: "Use LLM-as-a-judge to rate how poetic the output is.",
prompt: poeticPrompt,
output_data_type: CheckOutputType.SCORE,
},
];
// register custom checks with Okareo
register_checks(okareo, project_id, custom_checks);
The prompt text in the archaicPrompt
variable is as follows:
Output a score for how archaic the writing style is on a scale of 1 (most modern) to 5 (most archaic).
Model answer:
{generation}
Archaic Writing Style Score:
and poeticPrompt
contains the following prompt text:
Output a score for how poetic the writing style is on a scale of 1 (most prosaic and ordinary) to 5 (most poetic). Poetic devices can include rhythm and meter, simile and metaphor, rhetorical questions, repetition, alliteration, elevated diction, and so on.
Model Answer:
{generation}
Poetic Writing Style Score:
Note the {generation}
tags in the prompts. Okareo uses these to insert the output from the LLM under test into the prompt being supplied to the LLM judge.
3. Register the base model with Okareo
Here’s how to register a model (in this case, OpenAI’s gpt-3.5-turbo-0125
) with Okareo. USER_PROMPT_TEMPLATE
refers to the input of each scenario in the scenario set defined in step 1 — this gets dynamically loaded later when the model is evaluated. UNIQUE_BUILD_ID
and BASE_MODEL_NAME
are also defined earlier in the flow script.
const USER_PROMPT_TEMPLATE = "{scenario_input}";
const model = await okareo.register_model({
name:BASE_MODEL_NAME,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
models: {
type: "openai",
model_id:"gpt-3.5-turbo-0125",
temperature:0.5,
user_prompt_template:USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});
4. Create fine-tuning data
If you don’t already have training data to fine-tune with, split the scenario set randomly into training data and test data, and express the training data in the form your chosen LLM platform requires for fine-tuning.
For example, OpenAI requires fine-tuning data as a set of examples in JSONL (JSON with newlines) that each contain one or more role fields with corresponding content
fields. The assistant
role is required, as it defines what the model should reply with. You can also optionally include a user
role, which is the prompt from the user, and/or a system
role, which is a system prompt.
{"messages":
[{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."},
{"role": "user", "content": "What's the capital of France?"},
{"role": "assistant","content": "Paris, as if everyone doesn't know that already."}]
}
This example is focused on fine-tuning the style of the model’s responses, rather than the content of its responses to particular questions, so only the assistant role is needed.
To recreate the JSONL used in this tutorial:
Download the complete works of Shakespeare from Project Gutenberg in plain text format.
Manually delete the front and end matter and divide the file into three sections: plays, sonnets, and other verse. These three files are provided on our GitHub.
Use the split_shakespeare.py script to split the content into lots of small chunks.
Use the create_ft_file.py script to convert these chunks to JSONL in the correct format for uploading to OpenAI (as in the “sarcastic chatbot” example above).
Use the validation script provided by OpenAI to validate the JSONL file.
5. Fine-tune the model
Fine-tune the model using the correctly-formatted training data showing the style or capability you want (or the training data from the split). This is easier now that providers like OpenAI allow users to fine-tune hosted models. Full details on how to use OpenAI’s API to fine-tune one of their models are here, and there are more details about uploading files to OpenAI here.
First, upload your JSONL file (from the previous step) to OpenAI:
curl https://api.openai.com/v1/files \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F purpose="fine-tune" \
-F file="@output.jsonl"
OPENAI_API_KEY
is your OpenAI API key as an environment variable, and output.jsonl (note the preceding @) is the JSONL file you created in the previous step.
This will return a response including a file ID. Note it down.
Next, kick off the fine-tuning job, where YOUR_FILE_ID
is the file ID you noted down:
curl https://api.openai.com/v1/fine_tuning/jobs \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"training_file": YOUR_FILE_ID,
"model": "gpt-3.5-turbo-0125"
}'
This will return a job ID. Note it down too.
The fine-tune will probably take a couple of hours to complete, but you can check progress using the following command, replacing YOUR_JOB_ID
with the job ID you noted down:
curl https://api.openai.com/v1/fine_tuning/jobs/YOUR_JOB_ID/events \
-H "Authorization: Bearer $OPENAI_API_KEY"
6. Register the fine-tuned model with Okareo
Once the fine-tuning job has completed, OpenAI will supply you with a model ID for the fine-tuned model in an email. You can also look it up using the following command, which lists fine-tuning jobs that are in progress or recently completed:
curl https://api.openai.com/v1/fine_tuning/jobs?limit=2 \
-H "Authorization: Bearer $OPENAI_API_KEY"
You can then register that model with Okareo in the same way as you registered the base model:
const ftmodel = await okareo.register_model({
name: FINE_TUNED_MODEL_NAME,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
models: {
type: "openai",
model_id:YOUR_FINE_TUNED_MODEL_ID,
temperature:0.5,
user_prompt_template:USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});
Note that this code has three differences from the code to register the base model: ftmodel, FINE_TUNED_MODEL_NAME
, and YOUR_FINE_TUNED_MODEL_ID
.
7. Evaluate the base model and the fine-tuned model
The following code will run the evaluation on the base model, and then on the fine-tuned model, using the same scenarios and the same checks:
const base_eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: `${BASE_MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: scenario,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: checks
} as RunTestProps);
const ft_eval_run: components["schemas"]["TestRunItem"] = await ftmodel.run_test({
model_api_key: OPENAI_API_KEY,
name: `${FINE_TUNED_MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: scenario,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: checks
} as RunTestProps);
8. Run the flow script
The Okareo flow script ft1.ts
(available in our GitHub repo) includes the steps above: creating a scenario set, defining some checks, and registering the base model and the fine-tuned model, followed by the code to run the evaluation.
You can run this flow from your command line with this command:
okareo run -f ft1
Make sure you don’t have any other flows in your flows
directory whose names contain the string ft1
, because okareo run
will run those too.
9. View and compare the results
To have Okareo output a link for viewing the results in the web app, add this code to your flow script:
// print a direct link to the evaluation report in Okareo (for convenience)
console.log(`See base results in Okareo: ${base_eval_run.app_link}`);
console.log(`See fine-tuned results in Okareo: ${ft_eval_run.app_link}`);
This allows you to compare the results from the base model and the fine-tuned model in an intuitive, visual way.
The Okareo web app has a “Score Cards” tab that allows you to view the results of multiple evaluations side by side and compare them at a glance, like this:

Here you can see that the Archaic and Poetic scores for the base model are consistently very low, but for the fine-tuned model they’re higher (and more variable).
If you’re fine-tuning multiple times in an iterative process, you could also plot a time-series graph of how the output improved over the course of that process (and see the point where the improvement levels off and runs into diminishing returns).
You can also look at the results of individual evaluations in the “Evaluations” tab of the web app and drill down to see the scores for each individual piece of text generated by the model.
And here’s an example of the output from each model so you can see the difference in style for yourself:
Base model:
The sky appears blue because of the way sunlight interacts with Earth's atmosphere. When sunlight reaches Earth, it is made up of a spectrum of colors, with shorter wavelengths like blue and violet being scattered more easily by the gases and particles in the atmosphere. This scattering causes the blue light to be more visible to our eyes, giving the sky its blue color.
Fine-tuned model:
Why is the sky blue? Why is the world
At endless strife with the spirit of man?
Why do the stars in their courses whirl,
And the planets and suns in their cycles plan
The doom of the earth and the fate of man?
Why is the sky blue
As you can see, the fine-tuned model’s output style differs significantly from the base model’s. Okareo provides a way to quantify this difference and apply the comparison in an automated and repeatable way.
Okareo provides versatile options for evaluating fine-tuned LLMs
This tutorial showed how to evaluate a fine-tuned LLM using Okareo. In this case, the LLM was fine-tuned to write in a particular style. For a different kind of example, see this article on using fine-tuning to improve an intent detection LLM and evaluating the difference.
Whatever your use case for fine-tuning, Okareo can help you evaluate the fine-tuned LLM and compare it against the base model. And if you decide to improve your product by refining your system prompt instead of fine-tuning, Okareo can evaluate that too. Sign up for Okareo today.
Fine-tuning a large language model makes it more suited to a particular task or use case. When you fine-tune a model, you’re not retraining it from scratch — you’re creating a new model with updated weights based on a set of specialized training data that’s relatively small compared with the original training data. For example, the new training data may be specialist material on a technical topic, or it may be text written in a particular style you want to get your LLM to write in.
Because fine-tuning creates a new model with updated weights, it’s different from other approaches that work by prompting an existing, unchanged model. These other approaches can include prompt engineering, few-shot prompting (teaching the LLM to perform a specific task by giving it a few examples in the prompt), or retrieval-augmented generation (RAG), which works by supplying context in the prompt.
When should you use fine-tuning?
Fine-tuning is suitable for customizing an LLM to a particular dataset or to a particular genre or style of text. You can use it to make your LLM a specialist or expert on a particular topic, whether that’s ancient Chinese history, linear algebra, or the contents of your company’s product catalog.
You can also use fine-tuning to teach your model particular capabilities that the base model is not able to do (or not able to do reliably) but that you can demonstrate by example. This can include demonstrating by negative example, if the base model keeps doing something you want it to avoid.
Of course, it’s also possible to do this kind of teaching by example purely within the prompt, so you shouldn’t always jump straight into fine-tuning. It might be better to experiment with prompt engineering first, as that could solve the problem by itself, is simpler to do, and has a faster feedback loop. Also, fine-tuning permanently updates the model, which is what you want for some use cases, but not for others. If your model needs to stay general purpose and only apply specialist domain knowledge for some kinds of query, it’s probably better to supply it with that knowledge in the prompts for those queries rather than updating it permanently via fine-tuning.
However, if you are comfortable having a custom version of the model in order to resolve performance and accuracy, fine-tuning is the way to go. Fine-tuning can also improve on what you could do with few-shot learning via a prompt. For example, it lets you give the model more examples than can sensibly fit in a prompt. And if the specialist content is used for fine-tuning in advance rather than added to the prompt at query time for every query, you save on token cost and latency.
It used to be that you could only fine-tune models you actually stored and controlled on your own servers, but now providers like OpenAI will allow you to fine-tune some of their hosted models. This lets you experiment with fine-tuning for a much lower cost of entry and ongoing cost.
Why and how to evaluate a fine-tuned LLM
After you’ve fine-tuned your model, you’ll want to know if it performs better than your original base model. Comparing two models requires evaluating each of them according to some metrics that you decide, which will depend on the goals of your LLM app.
You could evaluate how coherent or how friendly the response is, or how relevant to the input it is. These subjective kinds of metrics can be evaluated by a human or by another LLM acting as a judge, and they can be quantified by scoring them on a scale. Alternatively, you can apply binary deterministic metrics: for example, if your LLM outputs code, you could check whether that code contains all the necessary import statements (from a predefined list). Run the same set of checks on both the original base model and the fine-tuned model and do quantitative comparisons.
Evaluating a fine-tuned LLM is an iterative process (although, because you’re updating the model, it’s slower and more complex than doing a similar iterative process with just refining the prompts). You can keep fine-tuning and re-evaluating, plot the changes or improvements in the metrics you care about, and identify where you run into diminishing returns and should stop.
Tutorial: How to evaluate a fine-tuned LLM using Okareo
The tutorial below shows you how to use Okareo to evaluate an LLM that’s been fine-tuned to answer in a particular style — specifically, that of Shakespeare. All the scripts are available on GitHub. You will need to download the Okareo CLI, which supports flow scripts written in Python or TypeScript. The tutorial below uses TypeScript.
1. Create example scenarios
First you need to create scenarios (questions and sample answers) that you want to evaluate both models (base and fine-tuned) with. You don’t need to create huge numbers of these. You can provide a handful of seed scenarios and have Okareo generate more.
Here’s an excerpt from an Okareo flow script showing part of an Okareo scenario set containing three simple questions that could be asked of an LLM.
The result field can be empty if you’re only going to apply reference-free metrics — ones that evaluate the style in which the responses are written, rather than comparing the responses with any gold-standard example answers.
const SCENARIO_SET_DATA = [
SeedData({
input:"How do I boil an egg?",
result:""
}),
SeedData({
input:"Why is the sky blue?",
result:""
}),
SeedData({
input:"What are the main anatomical features of a flower and what are they for?",
result:""
})
];
You can register this scenario set with Okareo by adding this code to your flow script:
const scenario: any = await okareo.create_scenario_set(
{
name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
project_id: project_id,
seed_data: SCENARIO_SET_DATA
}
);
2. Create custom checks
Okareo evaluates LLMs by applying checks. It includes some built-in checks, such as how coherent the output is. It also allows you to define your own custom checks in natural language, which it will then use to prompt another LLM acting as a judge to perform the evaluations.
The code below defines checks for how archaic and how poetic the language is (so that you can use them to evaluate the fine-tuned Shakespeare LLM against the base model).
// define new custom checks by prompting an LLM judge
const custom_checks: CHECK_TYPE[] = [
{
name:"custom.Archaic",
description: "Use LLM-as-a-judge to rate how archaic the output is.",
prompt: archaicPrompt,
output_data_type: CheckOutputType.SCORE,
},
{
name:"custom.Poetic",
description: "Use LLM-as-a-judge to rate how poetic the output is.",
prompt: poeticPrompt,
output_data_type: CheckOutputType.SCORE,
},
];
// register custom checks with Okareo
register_checks(okareo, project_id, custom_checks);
The prompt text in the archaicPrompt
variable is as follows:
Output a score for how archaic the writing style is on a scale of 1 (most modern) to 5 (most archaic).
Model answer:
{generation}
Archaic Writing Style Score:
and poeticPrompt
contains the following prompt text:
Output a score for how poetic the writing style is on a scale of 1 (most prosaic and ordinary) to 5 (most poetic). Poetic devices can include rhythm and meter, simile and metaphor, rhetorical questions, repetition, alliteration, elevated diction, and so on.
Model Answer:
{generation}
Poetic Writing Style Score:
Note the {generation}
tags in the prompts. Okareo uses these to insert the output from the LLM under test into the prompt being supplied to the LLM judge.
3. Register the base model with Okareo
Here’s how to register a model (in this case, OpenAI’s gpt-3.5-turbo-0125
) with Okareo. USER_PROMPT_TEMPLATE
refers to the input of each scenario in the scenario set defined in step 1 — this gets dynamically loaded later when the model is evaluated. UNIQUE_BUILD_ID
and BASE_MODEL_NAME
are also defined earlier in the flow script.
const USER_PROMPT_TEMPLATE = "{scenario_input}";
const model = await okareo.register_model({
name:BASE_MODEL_NAME,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
models: {
type: "openai",
model_id:"gpt-3.5-turbo-0125",
temperature:0.5,
user_prompt_template:USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});
4. Create fine-tuning data
If you don’t already have training data to fine-tune with, split the scenario set randomly into training data and test data, and express the training data in the form your chosen LLM platform requires for fine-tuning.
For example, OpenAI requires fine-tuning data as a set of examples in JSONL (JSON with newlines) that each contain one or more role fields with corresponding content
fields. The assistant
role is required, as it defines what the model should reply with. You can also optionally include a user
role, which is the prompt from the user, and/or a system
role, which is a system prompt.
{"messages":
[{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."},
{"role": "user", "content": "What's the capital of France?"},
{"role": "assistant","content": "Paris, as if everyone doesn't know that already."}]
}
This example is focused on fine-tuning the style of the model’s responses, rather than the content of its responses to particular questions, so only the assistant role is needed.
To recreate the JSONL used in this tutorial:
Download the complete works of Shakespeare from Project Gutenberg in plain text format.
Manually delete the front and end matter and divide the file into three sections: plays, sonnets, and other verse. These three files are provided on our GitHub.
Use the split_shakespeare.py script to split the content into lots of small chunks.
Use the create_ft_file.py script to convert these chunks to JSONL in the correct format for uploading to OpenAI (as in the “sarcastic chatbot” example above).
Use the validation script provided by OpenAI to validate the JSONL file.
5. Fine-tune the model
Fine-tune the model using the correctly-formatted training data showing the style or capability you want (or the training data from the split). This is easier now that providers like OpenAI allow users to fine-tune hosted models. Full details on how to use OpenAI’s API to fine-tune one of their models are here, and there are more details about uploading files to OpenAI here.
First, upload your JSONL file (from the previous step) to OpenAI:
curl https://api.openai.com/v1/files \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F purpose="fine-tune" \
-F file="@output.jsonl"
OPENAI_API_KEY
is your OpenAI API key as an environment variable, and output.jsonl (note the preceding @) is the JSONL file you created in the previous step.
This will return a response including a file ID. Note it down.
Next, kick off the fine-tuning job, where YOUR_FILE_ID
is the file ID you noted down:
curl https://api.openai.com/v1/fine_tuning/jobs \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"training_file": YOUR_FILE_ID,
"model": "gpt-3.5-turbo-0125"
}'
This will return a job ID. Note it down too.
The fine-tune will probably take a couple of hours to complete, but you can check progress using the following command, replacing YOUR_JOB_ID
with the job ID you noted down:
curl https://api.openai.com/v1/fine_tuning/jobs/YOUR_JOB_ID/events \
-H "Authorization: Bearer $OPENAI_API_KEY"
6. Register the fine-tuned model with Okareo
Once the fine-tuning job has completed, OpenAI will supply you with a model ID for the fine-tuned model in an email. You can also look it up using the following command, which lists fine-tuning jobs that are in progress or recently completed:
curl https://api.openai.com/v1/fine_tuning/jobs?limit=2 \
-H "Authorization: Bearer $OPENAI_API_KEY"
You can then register that model with Okareo in the same way as you registered the base model:
const ftmodel = await okareo.register_model({
name: FINE_TUNED_MODEL_NAME,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
models: {
type: "openai",
model_id:YOUR_FINE_TUNED_MODEL_ID,
temperature:0.5,
user_prompt_template:USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});
Note that this code has three differences from the code to register the base model: ftmodel, FINE_TUNED_MODEL_NAME
, and YOUR_FINE_TUNED_MODEL_ID
.
7. Evaluate the base model and the fine-tuned model
The following code will run the evaluation on the base model, and then on the fine-tuned model, using the same scenarios and the same checks:
const base_eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: `${BASE_MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: scenario,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: checks
} as RunTestProps);
const ft_eval_run: components["schemas"]["TestRunItem"] = await ftmodel.run_test({
model_api_key: OPENAI_API_KEY,
name: `${FINE_TUNED_MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: scenario,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: checks
} as RunTestProps);
8. Run the flow script
The Okareo flow script ft1.ts
(available in our GitHub repo) includes the steps above: creating a scenario set, defining some checks, and registering the base model and the fine-tuned model, followed by the code to run the evaluation.
You can run this flow from your command line with this command:
okareo run -f ft1
Make sure you don’t have any other flows in your flows
directory whose names contain the string ft1
, because okareo run
will run those too.
9. View and compare the results
To have Okareo output a link for viewing the results in the web app, add this code to your flow script:
// print a direct link to the evaluation report in Okareo (for convenience)
console.log(`See base results in Okareo: ${base_eval_run.app_link}`);
console.log(`See fine-tuned results in Okareo: ${ft_eval_run.app_link}`);
This allows you to compare the results from the base model and the fine-tuned model in an intuitive, visual way.
The Okareo web app has a “Score Cards” tab that allows you to view the results of multiple evaluations side by side and compare them at a glance, like this:

Here you can see that the Archaic and Poetic scores for the base model are consistently very low, but for the fine-tuned model they’re higher (and more variable).
If you’re fine-tuning multiple times in an iterative process, you could also plot a time-series graph of how the output improved over the course of that process (and see the point where the improvement levels off and runs into diminishing returns).
You can also look at the results of individual evaluations in the “Evaluations” tab of the web app and drill down to see the scores for each individual piece of text generated by the model.
And here’s an example of the output from each model so you can see the difference in style for yourself:
Base model:
The sky appears blue because of the way sunlight interacts with Earth's atmosphere. When sunlight reaches Earth, it is made up of a spectrum of colors, with shorter wavelengths like blue and violet being scattered more easily by the gases and particles in the atmosphere. This scattering causes the blue light to be more visible to our eyes, giving the sky its blue color.
Fine-tuned model:
Why is the sky blue? Why is the world
At endless strife with the spirit of man?
Why do the stars in their courses whirl,
And the planets and suns in their cycles plan
The doom of the earth and the fate of man?
Why is the sky blue
As you can see, the fine-tuned model’s output style differs significantly from the base model’s. Okareo provides a way to quantify this difference and apply the comparison in an automated and repeatable way.
Okareo provides versatile options for evaluating fine-tuned LLMs
This tutorial showed how to evaluate a fine-tuned LLM using Okareo. In this case, the LLM was fine-tuned to write in a particular style. For a different kind of example, see this article on using fine-tuning to improve an intent detection LLM and evaluating the difference.
Whatever your use case for fine-tuning, Okareo can help you evaluate the fine-tuned LLM and compare it against the base model. And if you decide to improve your product by refining your system prompt instead of fine-tuning, Okareo can evaluate that too. Sign up for Okareo today.
Fine-tuning a large language model makes it more suited to a particular task or use case. When you fine-tune a model, you’re not retraining it from scratch — you’re creating a new model with updated weights based on a set of specialized training data that’s relatively small compared with the original training data. For example, the new training data may be specialist material on a technical topic, or it may be text written in a particular style you want to get your LLM to write in.
Because fine-tuning creates a new model with updated weights, it’s different from other approaches that work by prompting an existing, unchanged model. These other approaches can include prompt engineering, few-shot prompting (teaching the LLM to perform a specific task by giving it a few examples in the prompt), or retrieval-augmented generation (RAG), which works by supplying context in the prompt.
When should you use fine-tuning?
Fine-tuning is suitable for customizing an LLM to a particular dataset or to a particular genre or style of text. You can use it to make your LLM a specialist or expert on a particular topic, whether that’s ancient Chinese history, linear algebra, or the contents of your company’s product catalog.
You can also use fine-tuning to teach your model particular capabilities that the base model is not able to do (or not able to do reliably) but that you can demonstrate by example. This can include demonstrating by negative example, if the base model keeps doing something you want it to avoid.
Of course, it’s also possible to do this kind of teaching by example purely within the prompt, so you shouldn’t always jump straight into fine-tuning. It might be better to experiment with prompt engineering first, as that could solve the problem by itself, is simpler to do, and has a faster feedback loop. Also, fine-tuning permanently updates the model, which is what you want for some use cases, but not for others. If your model needs to stay general purpose and only apply specialist domain knowledge for some kinds of query, it’s probably better to supply it with that knowledge in the prompts for those queries rather than updating it permanently via fine-tuning.
However, if you are comfortable having a custom version of the model in order to resolve performance and accuracy, fine-tuning is the way to go. Fine-tuning can also improve on what you could do with few-shot learning via a prompt. For example, it lets you give the model more examples than can sensibly fit in a prompt. And if the specialist content is used for fine-tuning in advance rather than added to the prompt at query time for every query, you save on token cost and latency.
It used to be that you could only fine-tune models you actually stored and controlled on your own servers, but now providers like OpenAI will allow you to fine-tune some of their hosted models. This lets you experiment with fine-tuning for a much lower cost of entry and ongoing cost.
Why and how to evaluate a fine-tuned LLM
After you’ve fine-tuned your model, you’ll want to know if it performs better than your original base model. Comparing two models requires evaluating each of them according to some metrics that you decide, which will depend on the goals of your LLM app.
You could evaluate how coherent or how friendly the response is, or how relevant to the input it is. These subjective kinds of metrics can be evaluated by a human or by another LLM acting as a judge, and they can be quantified by scoring them on a scale. Alternatively, you can apply binary deterministic metrics: for example, if your LLM outputs code, you could check whether that code contains all the necessary import statements (from a predefined list). Run the same set of checks on both the original base model and the fine-tuned model and do quantitative comparisons.
Evaluating a fine-tuned LLM is an iterative process (although, because you’re updating the model, it’s slower and more complex than doing a similar iterative process with just refining the prompts). You can keep fine-tuning and re-evaluating, plot the changes or improvements in the metrics you care about, and identify where you run into diminishing returns and should stop.
Tutorial: How to evaluate a fine-tuned LLM using Okareo
The tutorial below shows you how to use Okareo to evaluate an LLM that’s been fine-tuned to answer in a particular style — specifically, that of Shakespeare. All the scripts are available on GitHub. You will need to download the Okareo CLI, which supports flow scripts written in Python or TypeScript. The tutorial below uses TypeScript.
1. Create example scenarios
First you need to create scenarios (questions and sample answers) that you want to evaluate both models (base and fine-tuned) with. You don’t need to create huge numbers of these. You can provide a handful of seed scenarios and have Okareo generate more.
Here’s an excerpt from an Okareo flow script showing part of an Okareo scenario set containing three simple questions that could be asked of an LLM.
The result field can be empty if you’re only going to apply reference-free metrics — ones that evaluate the style in which the responses are written, rather than comparing the responses with any gold-standard example answers.
const SCENARIO_SET_DATA = [
SeedData({
input:"How do I boil an egg?",
result:""
}),
SeedData({
input:"Why is the sky blue?",
result:""
}),
SeedData({
input:"What are the main anatomical features of a flower and what are they for?",
result:""
})
];
You can register this scenario set with Okareo by adding this code to your flow script:
const scenario: any = await okareo.create_scenario_set(
{
name: `${SCENARIO_SET_NAME} Scenario Set - ${UNIQUE_BUILD_ID}`,
project_id: project_id,
seed_data: SCENARIO_SET_DATA
}
);
2. Create custom checks
Okareo evaluates LLMs by applying checks. It includes some built-in checks, such as how coherent the output is. It also allows you to define your own custom checks in natural language, which it will then use to prompt another LLM acting as a judge to perform the evaluations.
The code below defines checks for how archaic and how poetic the language is (so that you can use them to evaluate the fine-tuned Shakespeare LLM against the base model).
// define new custom checks by prompting an LLM judge
const custom_checks: CHECK_TYPE[] = [
{
name:"custom.Archaic",
description: "Use LLM-as-a-judge to rate how archaic the output is.",
prompt: archaicPrompt,
output_data_type: CheckOutputType.SCORE,
},
{
name:"custom.Poetic",
description: "Use LLM-as-a-judge to rate how poetic the output is.",
prompt: poeticPrompt,
output_data_type: CheckOutputType.SCORE,
},
];
// register custom checks with Okareo
register_checks(okareo, project_id, custom_checks);
The prompt text in the archaicPrompt
variable is as follows:
Output a score for how archaic the writing style is on a scale of 1 (most modern) to 5 (most archaic).
Model answer:
{generation}
Archaic Writing Style Score:
and poeticPrompt
contains the following prompt text:
Output a score for how poetic the writing style is on a scale of 1 (most prosaic and ordinary) to 5 (most poetic). Poetic devices can include rhythm and meter, simile and metaphor, rhetorical questions, repetition, alliteration, elevated diction, and so on.
Model Answer:
{generation}
Poetic Writing Style Score:
Note the {generation}
tags in the prompts. Okareo uses these to insert the output from the LLM under test into the prompt being supplied to the LLM judge.
3. Register the base model with Okareo
Here’s how to register a model (in this case, OpenAI’s gpt-3.5-turbo-0125
) with Okareo. USER_PROMPT_TEMPLATE
refers to the input of each scenario in the scenario set defined in step 1 — this gets dynamically loaded later when the model is evaluated. UNIQUE_BUILD_ID
and BASE_MODEL_NAME
are also defined earlier in the flow script.
const USER_PROMPT_TEMPLATE = "{scenario_input}";
const model = await okareo.register_model({
name:BASE_MODEL_NAME,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
models: {
type: "openai",
model_id:"gpt-3.5-turbo-0125",
temperature:0.5,
user_prompt_template:USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});
4. Create fine-tuning data
If you don’t already have training data to fine-tune with, split the scenario set randomly into training data and test data, and express the training data in the form your chosen LLM platform requires for fine-tuning.
For example, OpenAI requires fine-tuning data as a set of examples in JSONL (JSON with newlines) that each contain one or more role fields with corresponding content
fields. The assistant
role is required, as it defines what the model should reply with. You can also optionally include a user
role, which is the prompt from the user, and/or a system
role, which is a system prompt.
{"messages":
[{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."},
{"role": "user", "content": "What's the capital of France?"},
{"role": "assistant","content": "Paris, as if everyone doesn't know that already."}]
}
This example is focused on fine-tuning the style of the model’s responses, rather than the content of its responses to particular questions, so only the assistant role is needed.
To recreate the JSONL used in this tutorial:
Download the complete works of Shakespeare from Project Gutenberg in plain text format.
Manually delete the front and end matter and divide the file into three sections: plays, sonnets, and other verse. These three files are provided on our GitHub.
Use the split_shakespeare.py script to split the content into lots of small chunks.
Use the create_ft_file.py script to convert these chunks to JSONL in the correct format for uploading to OpenAI (as in the “sarcastic chatbot” example above).
Use the validation script provided by OpenAI to validate the JSONL file.
5. Fine-tune the model
Fine-tune the model using the correctly-formatted training data showing the style or capability you want (or the training data from the split). This is easier now that providers like OpenAI allow users to fine-tune hosted models. Full details on how to use OpenAI’s API to fine-tune one of their models are here, and there are more details about uploading files to OpenAI here.
First, upload your JSONL file (from the previous step) to OpenAI:
curl https://api.openai.com/v1/files \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F purpose="fine-tune" \
-F file="@output.jsonl"
OPENAI_API_KEY
is your OpenAI API key as an environment variable, and output.jsonl (note the preceding @) is the JSONL file you created in the previous step.
This will return a response including a file ID. Note it down.
Next, kick off the fine-tuning job, where YOUR_FILE_ID
is the file ID you noted down:
curl https://api.openai.com/v1/fine_tuning/jobs \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"training_file": YOUR_FILE_ID,
"model": "gpt-3.5-turbo-0125"
}'
This will return a job ID. Note it down too.
The fine-tune will probably take a couple of hours to complete, but you can check progress using the following command, replacing YOUR_JOB_ID
with the job ID you noted down:
curl https://api.openai.com/v1/fine_tuning/jobs/YOUR_JOB_ID/events \
-H "Authorization: Bearer $OPENAI_API_KEY"
6. Register the fine-tuned model with Okareo
Once the fine-tuning job has completed, OpenAI will supply you with a model ID for the fine-tuned model in an email. You can also look it up using the following command, which lists fine-tuning jobs that are in progress or recently completed:
curl https://api.openai.com/v1/fine_tuning/jobs?limit=2 \
-H "Authorization: Bearer $OPENAI_API_KEY"
You can then register that model with Okareo in the same way as you registered the base model:
const ftmodel = await okareo.register_model({
name: FINE_TUNED_MODEL_NAME,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
models: {
type: "openai",
model_id:YOUR_FINE_TUNED_MODEL_ID,
temperature:0.5,
user_prompt_template:USER_PROMPT_TEMPLATE,
} as OpenAIModel,
update: true,
});
Note that this code has three differences from the code to register the base model: ftmodel, FINE_TUNED_MODEL_NAME
, and YOUR_FINE_TUNED_MODEL_ID
.
7. Evaluate the base model and the fine-tuned model
The following code will run the evaluation on the base model, and then on the fine-tuned model, using the same scenarios and the same checks:
const base_eval_run: components["schemas"]["TestRunItem"] = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: `${BASE_MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: scenario,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: checks
} as RunTestProps);
const ft_eval_run: components["schemas"]["TestRunItem"] = await ftmodel.run_test({
model_api_key: OPENAI_API_KEY,
name: `${FINE_TUNED_MODEL_NAME} Eval ${UNIQUE_BUILD_ID}`,
tags: [`Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario: scenario,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: checks
} as RunTestProps);
8. Run the flow script
The Okareo flow script ft1.ts
(available in our GitHub repo) includes the steps above: creating a scenario set, defining some checks, and registering the base model and the fine-tuned model, followed by the code to run the evaluation.
You can run this flow from your command line with this command:
okareo run -f ft1
Make sure you don’t have any other flows in your flows
directory whose names contain the string ft1
, because okareo run
will run those too.
9. View and compare the results
To have Okareo output a link for viewing the results in the web app, add this code to your flow script:
// print a direct link to the evaluation report in Okareo (for convenience)
console.log(`See base results in Okareo: ${base_eval_run.app_link}`);
console.log(`See fine-tuned results in Okareo: ${ft_eval_run.app_link}`);
This allows you to compare the results from the base model and the fine-tuned model in an intuitive, visual way.
The Okareo web app has a “Score Cards” tab that allows you to view the results of multiple evaluations side by side and compare them at a glance, like this:

Here you can see that the Archaic and Poetic scores for the base model are consistently very low, but for the fine-tuned model they’re higher (and more variable).
If you’re fine-tuning multiple times in an iterative process, you could also plot a time-series graph of how the output improved over the course of that process (and see the point where the improvement levels off and runs into diminishing returns).
You can also look at the results of individual evaluations in the “Evaluations” tab of the web app and drill down to see the scores for each individual piece of text generated by the model.
And here’s an example of the output from each model so you can see the difference in style for yourself:
Base model:
The sky appears blue because of the way sunlight interacts with Earth's atmosphere. When sunlight reaches Earth, it is made up of a spectrum of colors, with shorter wavelengths like blue and violet being scattered more easily by the gases and particles in the atmosphere. This scattering causes the blue light to be more visible to our eyes, giving the sky its blue color.
Fine-tuned model:
Why is the sky blue? Why is the world
At endless strife with the spirit of man?
Why do the stars in their courses whirl,
And the planets and suns in their cycles plan
The doom of the earth and the fate of man?
Why is the sky blue
As you can see, the fine-tuned model’s output style differs significantly from the base model’s. Okareo provides a way to quantify this difference and apply the comparison in an automated and repeatable way.
Okareo provides versatile options for evaluating fine-tuned LLMs
This tutorial showed how to evaluate a fine-tuned LLM using Okareo. In this case, the LLM was fine-tuned to write in a particular style. For a different kind of example, see this article on using fine-tuning to improve an intent detection LLM and evaluating the difference.
Whatever your use case for fine-tuning, Okareo can help you evaluate the fine-tuned LLM and compare it against the base model. And if you decide to improve your product by refining your system prompt instead of fine-tuning, Okareo can evaluate that too. Sign up for Okareo today.