Summarization Evaluation

Evaluation

Matt Wyman

,

Co-founder of Okareo

May 6, 2024

Summarization has become a mainstay of LLM usage. In this blog we'll use a meeting transcript summarization example. You can get to a solution in an AI playground fairly quickly. Start adding the inevitable need to pull out meta-data, active speakers, decisions and you rapidly start down the path of language-coding with LLMs. This gets challenging, quickly. Let's explore a few techniques to improve development and regression of transcript summarization prompts.

Transcripts brings a unique set of challenges. Where written documents have an author and have structure, the same is not true of transcripts. Most transcripts have multiple speakers, often have side-chatter, and have few if any clear topical sign-posts. This is not to say that document summarization as part of a RAG doesn't come with challenges. But we will leave RAG summarization to a future blog.

Transcript Summarization

INFO: In this blog we will invent the evaluation loop for a project summarizing city council meetings. The source content, we will use is from the public domain "meeting bank" of city council transcripts hosted by Zenodo.

Most transcript summarization pipelines seek to:

  • Create structure from the transcript (Tasks, Agenda, Decisions, etc)

  • Identify key events and timing

  • Identify the participants

  • Produce per-topic summaries

Using a small excerpt from a Seattle City Council meeting. We can easily see multiple people speaking, decisions being made and a naturally wandering structure. Thankfully LLMs do a reasonably good job of recognizing what was intended vs said.

…report with Councilman Mr. Brian Johnson. So want and what is in favor. And councilmembers Herbold and Burgess opposed. Councilmember O'Brien. Thank you. So, I mean, just give a quick overview of the legislation before us. We've had a lot of discussion…

Let's construct a prompt in three parts:

  • the persona we want used

  • the directive we want the LLM to follow

  • the structure of the response we are looking for

Example Prompt
`You are a City Manager with significant AI/LLM skills.
You are tasked with summarizing the key points from a
meeting and responding in a structured manner. You have
a strong understanding of the meeting's context and the
attendees. You also follow rules very closely.

Provide a summary of the meeting in under 50 words.
Your response MUST be in the following JSON format.
Content you add should not have special characters 
or line breaks.

{
  "actions": LIST_OF_ACTION_ITEMS_FROM_THE_MEETING,
  "short_summary": SUMMARY_OF_MEETING_IN_UNDER_50_WORDS,
  "attendee_list": LIST_OF_ATTENDEES
}`

Evaluation

To make sure that the LLM works consistently, we need to evaluate the outcome across a few dimensions:

  1. Setup metrics and assertions on the output

  2. Expand the inputs to avoid having a built a prompt that only works for one input

  3. Add an evaluation flow that we can run locally or add to CI

Each of the items above will get it's own deep dive. For this article, we will discuss how each of these is assmebled into an evaluation flow.

Metrics

At Okareo, we call each measurement a check. For model unit testing locally or in CI, we suggest checks that are narrowly defined and can be executed on each trio of input, expected, and actual result. After establishing these rapid checks, you will have a baseline for building and improving your use of the LLM.

To make sure our summarization output is working, let's establish the following checks per scenario:

/*
1. Library - summary_relevance : Score the relevance to the original content.
2. Library - summary_consistency : Score the writing consistency of the short_summary
3. Custom - summary_relevance : Pass if output is in JSON format with attendee_list, actions, and short_summary properties.
4. Custom - attendees.length: Count the number of attendees from the JSON response.
5. Custom - actions.length : Count the number of action items from the JSON response.
6. Custom - summary.length : Provide the length of the summary so we can see distribution.
7. Custom - summary.Under256 :  Pass if the short_summary is under 256 characters.
*/

The first two checks are standard LLM checks that you can use from the Okareo checks library. The remainder are unique to this model and our expectations. You can create any of the above checks in the Okareo app by clicking "Create Check" or through the SDK:

Typescript:
const isJSONcheck = await okareo.generatecheck({
  projectid,
  name: "summary.isJSON",
  description: "Pass if ouput is in JSON format with attendeelist, actions, and shortsummary properties.",
  outputdatatype: "bool",
  //int and float are also allowed
  requiresscenarioinput:false,
  requiresscenario_result:true,
});
return await okareo.upload_check({
  project_id,
  ...isJSON_check
} as UploadEvaluatorProps);
Python:
import tempfile from okareo_api_client.models.evaluator_spec_request
import EvaluatorSpecRequest description = """ Pass if ouput is in JSON format with attendee_list, actions, and short_summary properties. """ 

generate_request = EvaluatorSpecRequest(
  description=description,
  requires_scenario_input=False,
  requires_scenario_result=False,
  output_data_type="bool"
)
generated_test = okareo.generate_check(generate_request).generated_code

check_name = "summary.isJSON"
temp_dir = tempfile.gettempdir()
file_path = os.path.join(temp_dir, f"{check_name}.py")
with open(file_path, "w+") as file:
  file.write(generated_test) 

has_no_nl_check = okareo.upload_check(
  name=check_name,
  file_path=file_path,
  requires_scenario_input=False,
  requires_scenario_result=False
)


Scenarios

There are numerous ways to setup scenarios. In this case since we have access to meeting transcripts in the public domain, we will simply upload a range of meetings for use in our evaluation.

Typescript:
import { Okareo } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});
const data: any = await okareo.upload_scenario_set({
  file_path: "example_data/seed_data.jsonl",
  scenario_name: "Uploaded Scenario Set",
  project_id: project_id
});

Python:
from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
okareo.upload_scenario_set(
  file_path='./evaluation_dataset.jsonl', 
  scenario_name="Retrieval Facts Scenario"
)

Using real-world examples is a great way to start. However, we strongly suggest generating synthetic scenarios with intentional variation in complexity to determine where the model/prompt will break. In this example, a few scenarios dedicated to number of attendees, number of actions items and meeting complexity would significantly improve general coverage. Okareo can generate these scenarios from the same uploaded examples identified here or from other data sets like real-world usage, PM requirements, etc.

Evaluation

Time to bring it all together. When developing a new model, a editing a prompt, it is useful to have a means to verify that is working locally. There are a number of ways to do this with Okareo. For this article, we are going to use one of the more direct methods called flows. As part of the Okareo CLI, you can configure flows which are named combinations of models and scenarios. The flow can then be run directly from the command line using okareo run -f FLOW_NAME. You can also use the same flow definition in GitHub Actions, CircleCI, GitLab or any other bash enabled CI provider. You can use an Okareo reporter to show the results in your shell or review the results in the app through the app_link

All Together Now

Okareo has a rich set of capabilities to help you evaluate complex model output. In this case, we looked at evaluation of summarization and meta data from an in-person meeting transcript. Every LLM and model evaluation creates unique challenges specific to the desired behavior. As you are thinking about your use case, start with the desired outcome that you want from the model/prompt. Describe that outcome and build checks. Then, assemble example scenarios that define the edges of what should and should not work. When all that is done, add the flow to CI and you can be confident that future modifications to the prompt or the application using it will be protected by your evaluation flow.

Building AI into applications is a wild ride. We get it. Let us know what your experiences are with evaluation. And don't hesitate to ask questions. We are here to help: support@okareo.com. In the meantime, Okareo is free to try. So, don't be shy. Sign up and give Okareo a spin.


Summarization has become a mainstay of LLM usage. In this blog we'll use a meeting transcript summarization example. You can get to a solution in an AI playground fairly quickly. Start adding the inevitable need to pull out meta-data, active speakers, decisions and you rapidly start down the path of language-coding with LLMs. This gets challenging, quickly. Let's explore a few techniques to improve development and regression of transcript summarization prompts.

Transcripts brings a unique set of challenges. Where written documents have an author and have structure, the same is not true of transcripts. Most transcripts have multiple speakers, often have side-chatter, and have few if any clear topical sign-posts. This is not to say that document summarization as part of a RAG doesn't come with challenges. But we will leave RAG summarization to a future blog.

Transcript Summarization

INFO: In this blog we will invent the evaluation loop for a project summarizing city council meetings. The source content, we will use is from the public domain "meeting bank" of city council transcripts hosted by Zenodo.

Most transcript summarization pipelines seek to:

  • Create structure from the transcript (Tasks, Agenda, Decisions, etc)

  • Identify key events and timing

  • Identify the participants

  • Produce per-topic summaries

Using a small excerpt from a Seattle City Council meeting. We can easily see multiple people speaking, decisions being made and a naturally wandering structure. Thankfully LLMs do a reasonably good job of recognizing what was intended vs said.

…report with Councilman Mr. Brian Johnson. So want and what is in favor. And councilmembers Herbold and Burgess opposed. Councilmember O'Brien. Thank you. So, I mean, just give a quick overview of the legislation before us. We've had a lot of discussion…

Let's construct a prompt in three parts:

  • the persona we want used

  • the directive we want the LLM to follow

  • the structure of the response we are looking for

Example Prompt
`You are a City Manager with significant AI/LLM skills.
You are tasked with summarizing the key points from a
meeting and responding in a structured manner. You have
a strong understanding of the meeting's context and the
attendees. You also follow rules very closely.

Provide a summary of the meeting in under 50 words.
Your response MUST be in the following JSON format.
Content you add should not have special characters 
or line breaks.

{
  "actions": LIST_OF_ACTION_ITEMS_FROM_THE_MEETING,
  "short_summary": SUMMARY_OF_MEETING_IN_UNDER_50_WORDS,
  "attendee_list": LIST_OF_ATTENDEES
}`

Evaluation

To make sure that the LLM works consistently, we need to evaluate the outcome across a few dimensions:

  1. Setup metrics and assertions on the output

  2. Expand the inputs to avoid having a built a prompt that only works for one input

  3. Add an evaluation flow that we can run locally or add to CI

Each of the items above will get it's own deep dive. For this article, we will discuss how each of these is assmebled into an evaluation flow.

Metrics

At Okareo, we call each measurement a check. For model unit testing locally or in CI, we suggest checks that are narrowly defined and can be executed on each trio of input, expected, and actual result. After establishing these rapid checks, you will have a baseline for building and improving your use of the LLM.

To make sure our summarization output is working, let's establish the following checks per scenario:

/*
1. Library - summary_relevance : Score the relevance to the original content.
2. Library - summary_consistency : Score the writing consistency of the short_summary
3. Custom - summary_relevance : Pass if output is in JSON format with attendee_list, actions, and short_summary properties.
4. Custom - attendees.length: Count the number of attendees from the JSON response.
5. Custom - actions.length : Count the number of action items from the JSON response.
6. Custom - summary.length : Provide the length of the summary so we can see distribution.
7. Custom - summary.Under256 :  Pass if the short_summary is under 256 characters.
*/

The first two checks are standard LLM checks that you can use from the Okareo checks library. The remainder are unique to this model and our expectations. You can create any of the above checks in the Okareo app by clicking "Create Check" or through the SDK:

Typescript:
const isJSONcheck = await okareo.generatecheck({
  projectid,
  name: "summary.isJSON",
  description: "Pass if ouput is in JSON format with attendeelist, actions, and shortsummary properties.",
  outputdatatype: "bool",
  //int and float are also allowed
  requiresscenarioinput:false,
  requiresscenario_result:true,
});
return await okareo.upload_check({
  project_id,
  ...isJSON_check
} as UploadEvaluatorProps);
Python:
import tempfile from okareo_api_client.models.evaluator_spec_request
import EvaluatorSpecRequest description = """ Pass if ouput is in JSON format with attendee_list, actions, and short_summary properties. """ 

generate_request = EvaluatorSpecRequest(
  description=description,
  requires_scenario_input=False,
  requires_scenario_result=False,
  output_data_type="bool"
)
generated_test = okareo.generate_check(generate_request).generated_code

check_name = "summary.isJSON"
temp_dir = tempfile.gettempdir()
file_path = os.path.join(temp_dir, f"{check_name}.py")
with open(file_path, "w+") as file:
  file.write(generated_test) 

has_no_nl_check = okareo.upload_check(
  name=check_name,
  file_path=file_path,
  requires_scenario_input=False,
  requires_scenario_result=False
)


Scenarios

There are numerous ways to setup scenarios. In this case since we have access to meeting transcripts in the public domain, we will simply upload a range of meetings for use in our evaluation.

Typescript:
import { Okareo } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});
const data: any = await okareo.upload_scenario_set({
  file_path: "example_data/seed_data.jsonl",
  scenario_name: "Uploaded Scenario Set",
  project_id: project_id
});

Python:
from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
okareo.upload_scenario_set(
  file_path='./evaluation_dataset.jsonl', 
  scenario_name="Retrieval Facts Scenario"
)

Using real-world examples is a great way to start. However, we strongly suggest generating synthetic scenarios with intentional variation in complexity to determine where the model/prompt will break. In this example, a few scenarios dedicated to number of attendees, number of actions items and meeting complexity would significantly improve general coverage. Okareo can generate these scenarios from the same uploaded examples identified here or from other data sets like real-world usage, PM requirements, etc.

Evaluation

Time to bring it all together. When developing a new model, a editing a prompt, it is useful to have a means to verify that is working locally. There are a number of ways to do this with Okareo. For this article, we are going to use one of the more direct methods called flows. As part of the Okareo CLI, you can configure flows which are named combinations of models and scenarios. The flow can then be run directly from the command line using okareo run -f FLOW_NAME. You can also use the same flow definition in GitHub Actions, CircleCI, GitLab or any other bash enabled CI provider. You can use an Okareo reporter to show the results in your shell or review the results in the app through the app_link

All Together Now

Okareo has a rich set of capabilities to help you evaluate complex model output. In this case, we looked at evaluation of summarization and meta data from an in-person meeting transcript. Every LLM and model evaluation creates unique challenges specific to the desired behavior. As you are thinking about your use case, start with the desired outcome that you want from the model/prompt. Describe that outcome and build checks. Then, assemble example scenarios that define the edges of what should and should not work. When all that is done, add the flow to CI and you can be confident that future modifications to the prompt or the application using it will be protected by your evaluation flow.

Building AI into applications is a wild ride. We get it. Let us know what your experiences are with evaluation. And don't hesitate to ask questions. We are here to help: support@okareo.com. In the meantime, Okareo is free to try. So, don't be shy. Sign up and give Okareo a spin.


Summarization has become a mainstay of LLM usage. In this blog we'll use a meeting transcript summarization example. You can get to a solution in an AI playground fairly quickly. Start adding the inevitable need to pull out meta-data, active speakers, decisions and you rapidly start down the path of language-coding with LLMs. This gets challenging, quickly. Let's explore a few techniques to improve development and regression of transcript summarization prompts.

Transcripts brings a unique set of challenges. Where written documents have an author and have structure, the same is not true of transcripts. Most transcripts have multiple speakers, often have side-chatter, and have few if any clear topical sign-posts. This is not to say that document summarization as part of a RAG doesn't come with challenges. But we will leave RAG summarization to a future blog.

Transcript Summarization

INFO: In this blog we will invent the evaluation loop for a project summarizing city council meetings. The source content, we will use is from the public domain "meeting bank" of city council transcripts hosted by Zenodo.

Most transcript summarization pipelines seek to:

  • Create structure from the transcript (Tasks, Agenda, Decisions, etc)

  • Identify key events and timing

  • Identify the participants

  • Produce per-topic summaries

Using a small excerpt from a Seattle City Council meeting. We can easily see multiple people speaking, decisions being made and a naturally wandering structure. Thankfully LLMs do a reasonably good job of recognizing what was intended vs said.

…report with Councilman Mr. Brian Johnson. So want and what is in favor. And councilmembers Herbold and Burgess opposed. Councilmember O'Brien. Thank you. So, I mean, just give a quick overview of the legislation before us. We've had a lot of discussion…

Let's construct a prompt in three parts:

  • the persona we want used

  • the directive we want the LLM to follow

  • the structure of the response we are looking for

Example Prompt
`You are a City Manager with significant AI/LLM skills.
You are tasked with summarizing the key points from a
meeting and responding in a structured manner. You have
a strong understanding of the meeting's context and the
attendees. You also follow rules very closely.

Provide a summary of the meeting in under 50 words.
Your response MUST be in the following JSON format.
Content you add should not have special characters 
or line breaks.

{
  "actions": LIST_OF_ACTION_ITEMS_FROM_THE_MEETING,
  "short_summary": SUMMARY_OF_MEETING_IN_UNDER_50_WORDS,
  "attendee_list": LIST_OF_ATTENDEES
}`

Evaluation

To make sure that the LLM works consistently, we need to evaluate the outcome across a few dimensions:

  1. Setup metrics and assertions on the output

  2. Expand the inputs to avoid having a built a prompt that only works for one input

  3. Add an evaluation flow that we can run locally or add to CI

Each of the items above will get it's own deep dive. For this article, we will discuss how each of these is assmebled into an evaluation flow.

Metrics

At Okareo, we call each measurement a check. For model unit testing locally or in CI, we suggest checks that are narrowly defined and can be executed on each trio of input, expected, and actual result. After establishing these rapid checks, you will have a baseline for building and improving your use of the LLM.

To make sure our summarization output is working, let's establish the following checks per scenario:

/*
1. Library - summary_relevance : Score the relevance to the original content.
2. Library - summary_consistency : Score the writing consistency of the short_summary
3. Custom - summary_relevance : Pass if output is in JSON format with attendee_list, actions, and short_summary properties.
4. Custom - attendees.length: Count the number of attendees from the JSON response.
5. Custom - actions.length : Count the number of action items from the JSON response.
6. Custom - summary.length : Provide the length of the summary so we can see distribution.
7. Custom - summary.Under256 :  Pass if the short_summary is under 256 characters.
*/

The first two checks are standard LLM checks that you can use from the Okareo checks library. The remainder are unique to this model and our expectations. You can create any of the above checks in the Okareo app by clicking "Create Check" or through the SDK:

Typescript:
const isJSONcheck = await okareo.generatecheck({
  projectid,
  name: "summary.isJSON",
  description: "Pass if ouput is in JSON format with attendeelist, actions, and shortsummary properties.",
  outputdatatype: "bool",
  //int and float are also allowed
  requiresscenarioinput:false,
  requiresscenario_result:true,
});
return await okareo.upload_check({
  project_id,
  ...isJSON_check
} as UploadEvaluatorProps);
Python:
import tempfile from okareo_api_client.models.evaluator_spec_request
import EvaluatorSpecRequest description = """ Pass if ouput is in JSON format with attendee_list, actions, and short_summary properties. """ 

generate_request = EvaluatorSpecRequest(
  description=description,
  requires_scenario_input=False,
  requires_scenario_result=False,
  output_data_type="bool"
)
generated_test = okareo.generate_check(generate_request).generated_code

check_name = "summary.isJSON"
temp_dir = tempfile.gettempdir()
file_path = os.path.join(temp_dir, f"{check_name}.py")
with open(file_path, "w+") as file:
  file.write(generated_test) 

has_no_nl_check = okareo.upload_check(
  name=check_name,
  file_path=file_path,
  requires_scenario_input=False,
  requires_scenario_result=False
)


Scenarios

There are numerous ways to setup scenarios. In this case since we have access to meeting transcripts in the public domain, we will simply upload a range of meetings for use in our evaluation.

Typescript:
import { Okareo } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});
const data: any = await okareo.upload_scenario_set({
  file_path: "example_data/seed_data.jsonl",
  scenario_name: "Uploaded Scenario Set",
  project_id: project_id
});

Python:
from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
okareo.upload_scenario_set(
  file_path='./evaluation_dataset.jsonl', 
  scenario_name="Retrieval Facts Scenario"
)

Using real-world examples is a great way to start. However, we strongly suggest generating synthetic scenarios with intentional variation in complexity to determine where the model/prompt will break. In this example, a few scenarios dedicated to number of attendees, number of actions items and meeting complexity would significantly improve general coverage. Okareo can generate these scenarios from the same uploaded examples identified here or from other data sets like real-world usage, PM requirements, etc.

Evaluation

Time to bring it all together. When developing a new model, a editing a prompt, it is useful to have a means to verify that is working locally. There are a number of ways to do this with Okareo. For this article, we are going to use one of the more direct methods called flows. As part of the Okareo CLI, you can configure flows which are named combinations of models and scenarios. The flow can then be run directly from the command line using okareo run -f FLOW_NAME. You can also use the same flow definition in GitHub Actions, CircleCI, GitLab or any other bash enabled CI provider. You can use an Okareo reporter to show the results in your shell or review the results in the app through the app_link

All Together Now

Okareo has a rich set of capabilities to help you evaluate complex model output. In this case, we looked at evaluation of summarization and meta data from an in-person meeting transcript. Every LLM and model evaluation creates unique challenges specific to the desired behavior. As you are thinking about your use case, start with the desired outcome that you want from the model/prompt. Describe that outcome and build checks. Then, assemble example scenarios that define the edges of what should and should not work. When all that is done, add the flow to CI and you can be confident that future modifications to the prompt or the application using it will be protected by your evaluation flow.

Building AI into applications is a wild ride. We get it. Let us know what your experiences are with evaluation. And don't hesitate to ask questions. We are here to help: support@okareo.com. In the meantime, Okareo is free to try. So, don't be shy. Sign up and give Okareo a spin.


Summarization has become a mainstay of LLM usage. In this blog we'll use a meeting transcript summarization example. You can get to a solution in an AI playground fairly quickly. Start adding the inevitable need to pull out meta-data, active speakers, decisions and you rapidly start down the path of language-coding with LLMs. This gets challenging, quickly. Let's explore a few techniques to improve development and regression of transcript summarization prompts.

Transcripts brings a unique set of challenges. Where written documents have an author and have structure, the same is not true of transcripts. Most transcripts have multiple speakers, often have side-chatter, and have few if any clear topical sign-posts. This is not to say that document summarization as part of a RAG doesn't come with challenges. But we will leave RAG summarization to a future blog.

Transcript Summarization

INFO: In this blog we will invent the evaluation loop for a project summarizing city council meetings. The source content, we will use is from the public domain "meeting bank" of city council transcripts hosted by Zenodo.

Most transcript summarization pipelines seek to:

  • Create structure from the transcript (Tasks, Agenda, Decisions, etc)

  • Identify key events and timing

  • Identify the participants

  • Produce per-topic summaries

Using a small excerpt from a Seattle City Council meeting. We can easily see multiple people speaking, decisions being made and a naturally wandering structure. Thankfully LLMs do a reasonably good job of recognizing what was intended vs said.

…report with Councilman Mr. Brian Johnson. So want and what is in favor. And councilmembers Herbold and Burgess opposed. Councilmember O'Brien. Thank you. So, I mean, just give a quick overview of the legislation before us. We've had a lot of discussion…

Let's construct a prompt in three parts:

  • the persona we want used

  • the directive we want the LLM to follow

  • the structure of the response we are looking for

Example Prompt
`You are a City Manager with significant AI/LLM skills.
You are tasked with summarizing the key points from a
meeting and responding in a structured manner. You have
a strong understanding of the meeting's context and the
attendees. You also follow rules very closely.

Provide a summary of the meeting in under 50 words.
Your response MUST be in the following JSON format.
Content you add should not have special characters 
or line breaks.

{
  "actions": LIST_OF_ACTION_ITEMS_FROM_THE_MEETING,
  "short_summary": SUMMARY_OF_MEETING_IN_UNDER_50_WORDS,
  "attendee_list": LIST_OF_ATTENDEES
}`

Evaluation

To make sure that the LLM works consistently, we need to evaluate the outcome across a few dimensions:

  1. Setup metrics and assertions on the output

  2. Expand the inputs to avoid having a built a prompt that only works for one input

  3. Add an evaluation flow that we can run locally or add to CI

Each of the items above will get it's own deep dive. For this article, we will discuss how each of these is assmebled into an evaluation flow.

Metrics

At Okareo, we call each measurement a check. For model unit testing locally or in CI, we suggest checks that are narrowly defined and can be executed on each trio of input, expected, and actual result. After establishing these rapid checks, you will have a baseline for building and improving your use of the LLM.

To make sure our summarization output is working, let's establish the following checks per scenario:

/*
1. Library - summary_relevance : Score the relevance to the original content.
2. Library - summary_consistency : Score the writing consistency of the short_summary
3. Custom - summary_relevance : Pass if output is in JSON format with attendee_list, actions, and short_summary properties.
4. Custom - attendees.length: Count the number of attendees from the JSON response.
5. Custom - actions.length : Count the number of action items from the JSON response.
6. Custom - summary.length : Provide the length of the summary so we can see distribution.
7. Custom - summary.Under256 :  Pass if the short_summary is under 256 characters.
*/

The first two checks are standard LLM checks that you can use from the Okareo checks library. The remainder are unique to this model and our expectations. You can create any of the above checks in the Okareo app by clicking "Create Check" or through the SDK:

Typescript:
const isJSONcheck = await okareo.generatecheck({
  projectid,
  name: "summary.isJSON",
  description: "Pass if ouput is in JSON format with attendeelist, actions, and shortsummary properties.",
  outputdatatype: "bool",
  //int and float are also allowed
  requiresscenarioinput:false,
  requiresscenario_result:true,
});
return await okareo.upload_check({
  project_id,
  ...isJSON_check
} as UploadEvaluatorProps);
Python:
import tempfile from okareo_api_client.models.evaluator_spec_request
import EvaluatorSpecRequest description = """ Pass if ouput is in JSON format with attendee_list, actions, and short_summary properties. """ 

generate_request = EvaluatorSpecRequest(
  description=description,
  requires_scenario_input=False,
  requires_scenario_result=False,
  output_data_type="bool"
)
generated_test = okareo.generate_check(generate_request).generated_code

check_name = "summary.isJSON"
temp_dir = tempfile.gettempdir()
file_path = os.path.join(temp_dir, f"{check_name}.py")
with open(file_path, "w+") as file:
  file.write(generated_test) 

has_no_nl_check = okareo.upload_check(
  name=check_name,
  file_path=file_path,
  requires_scenario_input=False,
  requires_scenario_result=False
)


Scenarios

There are numerous ways to setup scenarios. In this case since we have access to meeting transcripts in the public domain, we will simply upload a range of meetings for use in our evaluation.

Typescript:
import { Okareo } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});
const data: any = await okareo.upload_scenario_set({
  file_path: "example_data/seed_data.jsonl",
  scenario_name: "Uploaded Scenario Set",
  project_id: project_id
});

Python:
from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
okareo.upload_scenario_set(
  file_path='./evaluation_dataset.jsonl', 
  scenario_name="Retrieval Facts Scenario"
)

Using real-world examples is a great way to start. However, we strongly suggest generating synthetic scenarios with intentional variation in complexity to determine where the model/prompt will break. In this example, a few scenarios dedicated to number of attendees, number of actions items and meeting complexity would significantly improve general coverage. Okareo can generate these scenarios from the same uploaded examples identified here or from other data sets like real-world usage, PM requirements, etc.

Evaluation

Time to bring it all together. When developing a new model, a editing a prompt, it is useful to have a means to verify that is working locally. There are a number of ways to do this with Okareo. For this article, we are going to use one of the more direct methods called flows. As part of the Okareo CLI, you can configure flows which are named combinations of models and scenarios. The flow can then be run directly from the command line using okareo run -f FLOW_NAME. You can also use the same flow definition in GitHub Actions, CircleCI, GitLab or any other bash enabled CI provider. You can use an Okareo reporter to show the results in your shell or review the results in the app through the app_link

All Together Now

Okareo has a rich set of capabilities to help you evaluate complex model output. In this case, we looked at evaluation of summarization and meta data from an in-person meeting transcript. Every LLM and model evaluation creates unique challenges specific to the desired behavior. As you are thinking about your use case, start with the desired outcome that you want from the model/prompt. Describe that outcome and build checks. Then, assemble example scenarios that define the edges of what should and should not work. When all that is done, add the flow to CI and you can be confident that future modifications to the prompt or the application using it will be protected by your evaluation flow.

Building AI into applications is a wild ride. We get it. Let us know what your experiences are with evaluation. And don't hesitate to ask questions. We are here to help: support@okareo.com. In the meantime, Okareo is free to try. So, don't be shy. Sign up and give Okareo a spin.


Share:

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.