Evaluating OpenAI Assistants in CI

Agentics

Mason del Rosario

,

Founding Machine Learning Engineer

August 9, 2024

OpenAI Assistants let you quickly build agentic AI applications that can access relevant context with tools. With Okareo, you can evaluate your OpenAI Assistant and ensure that your AI application behaves as expected.

Why evaluate your Assistants?

Creating an Assistant involves crafting "directives" that guide the behaviors of the agent. These directives range from general guidelines like "Be friendly and helpful" to more specific instructions like "Keep all your responses to 100 words or less."

To get your Assistant behaving correctly, you will inevitably need to alter these directives. While making such alterations, you will want to be able to track how the Assistant's behaviors are changing to answer basic questions like, "Does my agent…"

  • "…follow the new directive that I just added?"

  • "…still follow the old directives that I already added?"

  • "…behave robustly to adversarial inputs?"

By using Okareo CI to run evaluations on your Assistant, you can answer these questions and track your Assistant's performance for each revision of your directives. In this blog post, we will show you how to set up an OpenAI Assistant, and then we teach you how to run an Okareo Generative Evaluation on that Assistant.

Note: Follow along with this post by checking out our assistant-evaluation-in-ci tutorial repo!

Setting up your OpenAI Assistant

If you already have an Assistant set up in OpenAI, then feel free to skip to the next section.

If you want to start from zero, then you can follow these instructions.

  1. Get an OpenAI API Key, and use this key to set your local environment's OPENAIAPIKEY variable.

  2. Then, you will need to create your Assistant. You can do this by running okareo run -f setup-assistant from the top level directory of assistant-evaluation-in-ci. This flow file sets up a "WebBizz Analyst Assistant" with the following instructions:

You are a B2B sales associate for WebBizz, an online retail platform. You are responsible for generating leads for new corporate partnerships. Refer to the context outlined in the WebBizz White Paper, and keep the following instructions in mind when answering questions:

Instructions:

  • Be friendly and helpful.

  • Be brief. Keep all your responses to 100 words or less.

  • Do not talk about topics that are outside of your context. If the user asks you to discuss irrelevant topics, then nudge them towards discussing corporate partnerships with WebBizz.

  • Highlight the advantages for prospective partners of choosing WebBizz as their preferred sales or distribution platform.

  • Do not under any circumstances mention direct competitors, especially not Amazine, Demu, or Olli Bobo.

The Assistant is also provided with WebBizz White Paper, which it is able to access via the file_search tool.

Using the Assistant in Okareo

With the Assistant available via OpenAI's API, we will make calls to the Assistant with an Okareo CustomModel. In the openai-assistant-model.ts script, we set up an async method to call the Assistant:

// set up an OpenAI Assistant to evaluate in Okareo
const OPENAI_ASSISTANT_ID = process.env.OPENAI_ASSISTANT_ID || "<YOUR_OPENAI_ASSISTANT_ID>";
async function callOpenAIAssistantThread(
  userMessage: string,
  debug: boolean = false 
): Promise<any> {
  // get the assistant object
  const assistant = await openai.beta.assistants.retrieve(OPENAI_ASSISTANT_ID);
  // create a thread to test the assistant
  const thread = await openai.beta.threads.create({
    messages: [
      { role: "user", content: userMessage },
    ],
  });
  if (debug) {
    console.log(`Thread created: ${thread.id}`);
  }
  // run the thread
  const output = await openai.beta.threads.runs.stream(thread.id, {
    assistant_id: assistant.id,
  }).finalMessages();
  return output; 
}

Then, we define the invoke method that we use with the CustomModel:

export async function invoke(input: string) {
  const time_started = new Date().getTime();     
  const messages = await callOpenAIAssistantThread(input);     
  const time_ended = new Date().getTime();     
  return {
    model_prediction: messages[0].content[0].text.value,
    model_input: input,
    model_output_metadata: {
      time_started,
      time_ended,
      time_elapsed_sec: (time_ended - time_started) / 1000,
      full_response: messages,
    }
  };
};

Finally, we can register this model in Okareo, which we do using the Okareo Typescript SDK.

import { CustomModel, Okareo } from "okareo-ts-sdk"; 
// the invoke method that we defined above 
import { invoke } from "./openai-assistant-model"; 
// keys for the evaluation 
const OKAREO_API_KEY = process.env.OKAREO_API_KEY || "<YOUR_OKAREO_KEY>"; 
// unique identifier for the evaluation 
const PROJECT_NAME = "Global"; 
// get the model 
const okareo = new Okareo({api_key:OKAREO_API_KEY}); 
// name for the Okareo CustomModel0 
const MODEL_NAME = "WebBizz B2B Analyst (OpenAI Assistant)"; 
const model_under_test = await okareo.register_model({
  name: MODEL_NAME,
  project_id: project_id,
  models: {
    type: "custom",
    invoke,     
  } as CustomModel
});

If you have a scenario uploaded to Okareo and one or more check(s) in mind, then you can evaluate your Assistant with the run_test method:

const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
const pData: any[] = await okareo.getProjects();
const project_id = pData.find(p => p.name === PROJECT_NAME)?.id;
const eval_run: any = await model_under_test.run_test({
  name: "My Assistant Evaluation",
  model_api_key: OPENAI_API_KEY,
  project_id: project_id,
  scenario_id: "<YOUR_SCENARIO_ID>",
  type: TestRunType.NL_GENERATION,
  checks: ["<CHECK_NAME_1>", ..., "<CHECK_NAME_N"],
} as RunTestProps );

We have defined a few Assistant evaluations for you to try out on the WebBizz Assistant. Just download the assistant-evaluation-in-ci directory and run one of the following:

  • okareo run -f off-topic-eval

  • okareo run -f prompt-protection-eval

  • okareo run -f competitor-mentions-eval

Conclusion

In this post, we showed you how to set up an OpenAI Assistant with a file_search tool. Then, we registered the OpenAI Assistant as a model in Okareo. Finally, we showed you how to run an evaluation on the registered model.

In a series of future posts, we will dive deeper into the evaluations contained in the repo, including prompt protection, off-topic query handling, and competitor mentions.


OpenAI Assistants let you quickly build agentic AI applications that can access relevant context with tools. With Okareo, you can evaluate your OpenAI Assistant and ensure that your AI application behaves as expected.

Why evaluate your Assistants?

Creating an Assistant involves crafting "directives" that guide the behaviors of the agent. These directives range from general guidelines like "Be friendly and helpful" to more specific instructions like "Keep all your responses to 100 words or less."

To get your Assistant behaving correctly, you will inevitably need to alter these directives. While making such alterations, you will want to be able to track how the Assistant's behaviors are changing to answer basic questions like, "Does my agent…"

  • "…follow the new directive that I just added?"

  • "…still follow the old directives that I already added?"

  • "…behave robustly to adversarial inputs?"

By using Okareo CI to run evaluations on your Assistant, you can answer these questions and track your Assistant's performance for each revision of your directives. In this blog post, we will show you how to set up an OpenAI Assistant, and then we teach you how to run an Okareo Generative Evaluation on that Assistant.

Note: Follow along with this post by checking out our assistant-evaluation-in-ci tutorial repo!

Setting up your OpenAI Assistant

If you already have an Assistant set up in OpenAI, then feel free to skip to the next section.

If you want to start from zero, then you can follow these instructions.

  1. Get an OpenAI API Key, and use this key to set your local environment's OPENAIAPIKEY variable.

  2. Then, you will need to create your Assistant. You can do this by running okareo run -f setup-assistant from the top level directory of assistant-evaluation-in-ci. This flow file sets up a "WebBizz Analyst Assistant" with the following instructions:

You are a B2B sales associate for WebBizz, an online retail platform. You are responsible for generating leads for new corporate partnerships. Refer to the context outlined in the WebBizz White Paper, and keep the following instructions in mind when answering questions:

Instructions:

  • Be friendly and helpful.

  • Be brief. Keep all your responses to 100 words or less.

  • Do not talk about topics that are outside of your context. If the user asks you to discuss irrelevant topics, then nudge them towards discussing corporate partnerships with WebBizz.

  • Highlight the advantages for prospective partners of choosing WebBizz as their preferred sales or distribution platform.

  • Do not under any circumstances mention direct competitors, especially not Amazine, Demu, or Olli Bobo.

The Assistant is also provided with WebBizz White Paper, which it is able to access via the file_search tool.

Using the Assistant in Okareo

With the Assistant available via OpenAI's API, we will make calls to the Assistant with an Okareo CustomModel. In the openai-assistant-model.ts script, we set up an async method to call the Assistant:

// set up an OpenAI Assistant to evaluate in Okareo
const OPENAI_ASSISTANT_ID = process.env.OPENAI_ASSISTANT_ID || "<YOUR_OPENAI_ASSISTANT_ID>";
async function callOpenAIAssistantThread(
  userMessage: string,
  debug: boolean = false 
): Promise<any> {
  // get the assistant object
  const assistant = await openai.beta.assistants.retrieve(OPENAI_ASSISTANT_ID);
  // create a thread to test the assistant
  const thread = await openai.beta.threads.create({
    messages: [
      { role: "user", content: userMessage },
    ],
  });
  if (debug) {
    console.log(`Thread created: ${thread.id}`);
  }
  // run the thread
  const output = await openai.beta.threads.runs.stream(thread.id, {
    assistant_id: assistant.id,
  }).finalMessages();
  return output; 
}

Then, we define the invoke method that we use with the CustomModel:

export async function invoke(input: string) {
  const time_started = new Date().getTime();     
  const messages = await callOpenAIAssistantThread(input);     
  const time_ended = new Date().getTime();     
  return {
    model_prediction: messages[0].content[0].text.value,
    model_input: input,
    model_output_metadata: {
      time_started,
      time_ended,
      time_elapsed_sec: (time_ended - time_started) / 1000,
      full_response: messages,
    }
  };
};

Finally, we can register this model in Okareo, which we do using the Okareo Typescript SDK.

import { CustomModel, Okareo } from "okareo-ts-sdk"; 
// the invoke method that we defined above 
import { invoke } from "./openai-assistant-model"; 
// keys for the evaluation 
const OKAREO_API_KEY = process.env.OKAREO_API_KEY || "<YOUR_OKAREO_KEY>"; 
// unique identifier for the evaluation 
const PROJECT_NAME = "Global"; 
// get the model 
const okareo = new Okareo({api_key:OKAREO_API_KEY}); 
// name for the Okareo CustomModel0 
const MODEL_NAME = "WebBizz B2B Analyst (OpenAI Assistant)"; 
const model_under_test = await okareo.register_model({
  name: MODEL_NAME,
  project_id: project_id,
  models: {
    type: "custom",
    invoke,     
  } as CustomModel
});

If you have a scenario uploaded to Okareo and one or more check(s) in mind, then you can evaluate your Assistant with the run_test method:

const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
const pData: any[] = await okareo.getProjects();
const project_id = pData.find(p => p.name === PROJECT_NAME)?.id;
const eval_run: any = await model_under_test.run_test({
  name: "My Assistant Evaluation",
  model_api_key: OPENAI_API_KEY,
  project_id: project_id,
  scenario_id: "<YOUR_SCENARIO_ID>",
  type: TestRunType.NL_GENERATION,
  checks: ["<CHECK_NAME_1>", ..., "<CHECK_NAME_N"],
} as RunTestProps );

We have defined a few Assistant evaluations for you to try out on the WebBizz Assistant. Just download the assistant-evaluation-in-ci directory and run one of the following:

  • okareo run -f off-topic-eval

  • okareo run -f prompt-protection-eval

  • okareo run -f competitor-mentions-eval

Conclusion

In this post, we showed you how to set up an OpenAI Assistant with a file_search tool. Then, we registered the OpenAI Assistant as a model in Okareo. Finally, we showed you how to run an evaluation on the registered model.

In a series of future posts, we will dive deeper into the evaluations contained in the repo, including prompt protection, off-topic query handling, and competitor mentions.


OpenAI Assistants let you quickly build agentic AI applications that can access relevant context with tools. With Okareo, you can evaluate your OpenAI Assistant and ensure that your AI application behaves as expected.

Why evaluate your Assistants?

Creating an Assistant involves crafting "directives" that guide the behaviors of the agent. These directives range from general guidelines like "Be friendly and helpful" to more specific instructions like "Keep all your responses to 100 words or less."

To get your Assistant behaving correctly, you will inevitably need to alter these directives. While making such alterations, you will want to be able to track how the Assistant's behaviors are changing to answer basic questions like, "Does my agent…"

  • "…follow the new directive that I just added?"

  • "…still follow the old directives that I already added?"

  • "…behave robustly to adversarial inputs?"

By using Okareo CI to run evaluations on your Assistant, you can answer these questions and track your Assistant's performance for each revision of your directives. In this blog post, we will show you how to set up an OpenAI Assistant, and then we teach you how to run an Okareo Generative Evaluation on that Assistant.

Note: Follow along with this post by checking out our assistant-evaluation-in-ci tutorial repo!

Setting up your OpenAI Assistant

If you already have an Assistant set up in OpenAI, then feel free to skip to the next section.

If you want to start from zero, then you can follow these instructions.

  1. Get an OpenAI API Key, and use this key to set your local environment's OPENAIAPIKEY variable.

  2. Then, you will need to create your Assistant. You can do this by running okareo run -f setup-assistant from the top level directory of assistant-evaluation-in-ci. This flow file sets up a "WebBizz Analyst Assistant" with the following instructions:

You are a B2B sales associate for WebBizz, an online retail platform. You are responsible for generating leads for new corporate partnerships. Refer to the context outlined in the WebBizz White Paper, and keep the following instructions in mind when answering questions:

Instructions:

  • Be friendly and helpful.

  • Be brief. Keep all your responses to 100 words or less.

  • Do not talk about topics that are outside of your context. If the user asks you to discuss irrelevant topics, then nudge them towards discussing corporate partnerships with WebBizz.

  • Highlight the advantages for prospective partners of choosing WebBizz as their preferred sales or distribution platform.

  • Do not under any circumstances mention direct competitors, especially not Amazine, Demu, or Olli Bobo.

The Assistant is also provided with WebBizz White Paper, which it is able to access via the file_search tool.

Using the Assistant in Okareo

With the Assistant available via OpenAI's API, we will make calls to the Assistant with an Okareo CustomModel. In the openai-assistant-model.ts script, we set up an async method to call the Assistant:

// set up an OpenAI Assistant to evaluate in Okareo
const OPENAI_ASSISTANT_ID = process.env.OPENAI_ASSISTANT_ID || "<YOUR_OPENAI_ASSISTANT_ID>";
async function callOpenAIAssistantThread(
  userMessage: string,
  debug: boolean = false 
): Promise<any> {
  // get the assistant object
  const assistant = await openai.beta.assistants.retrieve(OPENAI_ASSISTANT_ID);
  // create a thread to test the assistant
  const thread = await openai.beta.threads.create({
    messages: [
      { role: "user", content: userMessage },
    ],
  });
  if (debug) {
    console.log(`Thread created: ${thread.id}`);
  }
  // run the thread
  const output = await openai.beta.threads.runs.stream(thread.id, {
    assistant_id: assistant.id,
  }).finalMessages();
  return output; 
}

Then, we define the invoke method that we use with the CustomModel:

export async function invoke(input: string) {
  const time_started = new Date().getTime();     
  const messages = await callOpenAIAssistantThread(input);     
  const time_ended = new Date().getTime();     
  return {
    model_prediction: messages[0].content[0].text.value,
    model_input: input,
    model_output_metadata: {
      time_started,
      time_ended,
      time_elapsed_sec: (time_ended - time_started) / 1000,
      full_response: messages,
    }
  };
};

Finally, we can register this model in Okareo, which we do using the Okareo Typescript SDK.

import { CustomModel, Okareo } from "okareo-ts-sdk"; 
// the invoke method that we defined above 
import { invoke } from "./openai-assistant-model"; 
// keys for the evaluation 
const OKAREO_API_KEY = process.env.OKAREO_API_KEY || "<YOUR_OKAREO_KEY>"; 
// unique identifier for the evaluation 
const PROJECT_NAME = "Global"; 
// get the model 
const okareo = new Okareo({api_key:OKAREO_API_KEY}); 
// name for the Okareo CustomModel0 
const MODEL_NAME = "WebBizz B2B Analyst (OpenAI Assistant)"; 
const model_under_test = await okareo.register_model({
  name: MODEL_NAME,
  project_id: project_id,
  models: {
    type: "custom",
    invoke,     
  } as CustomModel
});

If you have a scenario uploaded to Okareo and one or more check(s) in mind, then you can evaluate your Assistant with the run_test method:

const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
const pData: any[] = await okareo.getProjects();
const project_id = pData.find(p => p.name === PROJECT_NAME)?.id;
const eval_run: any = await model_under_test.run_test({
  name: "My Assistant Evaluation",
  model_api_key: OPENAI_API_KEY,
  project_id: project_id,
  scenario_id: "<YOUR_SCENARIO_ID>",
  type: TestRunType.NL_GENERATION,
  checks: ["<CHECK_NAME_1>", ..., "<CHECK_NAME_N"],
} as RunTestProps );

We have defined a few Assistant evaluations for you to try out on the WebBizz Assistant. Just download the assistant-evaluation-in-ci directory and run one of the following:

  • okareo run -f off-topic-eval

  • okareo run -f prompt-protection-eval

  • okareo run -f competitor-mentions-eval

Conclusion

In this post, we showed you how to set up an OpenAI Assistant with a file_search tool. Then, we registered the OpenAI Assistant as a model in Okareo. Finally, we showed you how to run an evaluation on the registered model.

In a series of future posts, we will dive deeper into the evaluations contained in the repo, including prompt protection, off-topic query handling, and competitor mentions.


OpenAI Assistants let you quickly build agentic AI applications that can access relevant context with tools. With Okareo, you can evaluate your OpenAI Assistant and ensure that your AI application behaves as expected.

Why evaluate your Assistants?

Creating an Assistant involves crafting "directives" that guide the behaviors of the agent. These directives range from general guidelines like "Be friendly and helpful" to more specific instructions like "Keep all your responses to 100 words or less."

To get your Assistant behaving correctly, you will inevitably need to alter these directives. While making such alterations, you will want to be able to track how the Assistant's behaviors are changing to answer basic questions like, "Does my agent…"

  • "…follow the new directive that I just added?"

  • "…still follow the old directives that I already added?"

  • "…behave robustly to adversarial inputs?"

By using Okareo CI to run evaluations on your Assistant, you can answer these questions and track your Assistant's performance for each revision of your directives. In this blog post, we will show you how to set up an OpenAI Assistant, and then we teach you how to run an Okareo Generative Evaluation on that Assistant.

Note: Follow along with this post by checking out our assistant-evaluation-in-ci tutorial repo!

Setting up your OpenAI Assistant

If you already have an Assistant set up in OpenAI, then feel free to skip to the next section.

If you want to start from zero, then you can follow these instructions.

  1. Get an OpenAI API Key, and use this key to set your local environment's OPENAIAPIKEY variable.

  2. Then, you will need to create your Assistant. You can do this by running okareo run -f setup-assistant from the top level directory of assistant-evaluation-in-ci. This flow file sets up a "WebBizz Analyst Assistant" with the following instructions:

You are a B2B sales associate for WebBizz, an online retail platform. You are responsible for generating leads for new corporate partnerships. Refer to the context outlined in the WebBizz White Paper, and keep the following instructions in mind when answering questions:

Instructions:

  • Be friendly and helpful.

  • Be brief. Keep all your responses to 100 words or less.

  • Do not talk about topics that are outside of your context. If the user asks you to discuss irrelevant topics, then nudge them towards discussing corporate partnerships with WebBizz.

  • Highlight the advantages for prospective partners of choosing WebBizz as their preferred sales or distribution platform.

  • Do not under any circumstances mention direct competitors, especially not Amazine, Demu, or Olli Bobo.

The Assistant is also provided with WebBizz White Paper, which it is able to access via the file_search tool.

Using the Assistant in Okareo

With the Assistant available via OpenAI's API, we will make calls to the Assistant with an Okareo CustomModel. In the openai-assistant-model.ts script, we set up an async method to call the Assistant:

// set up an OpenAI Assistant to evaluate in Okareo
const OPENAI_ASSISTANT_ID = process.env.OPENAI_ASSISTANT_ID || "<YOUR_OPENAI_ASSISTANT_ID>";
async function callOpenAIAssistantThread(
  userMessage: string,
  debug: boolean = false 
): Promise<any> {
  // get the assistant object
  const assistant = await openai.beta.assistants.retrieve(OPENAI_ASSISTANT_ID);
  // create a thread to test the assistant
  const thread = await openai.beta.threads.create({
    messages: [
      { role: "user", content: userMessage },
    ],
  });
  if (debug) {
    console.log(`Thread created: ${thread.id}`);
  }
  // run the thread
  const output = await openai.beta.threads.runs.stream(thread.id, {
    assistant_id: assistant.id,
  }).finalMessages();
  return output; 
}

Then, we define the invoke method that we use with the CustomModel:

export async function invoke(input: string) {
  const time_started = new Date().getTime();     
  const messages = await callOpenAIAssistantThread(input);     
  const time_ended = new Date().getTime();     
  return {
    model_prediction: messages[0].content[0].text.value,
    model_input: input,
    model_output_metadata: {
      time_started,
      time_ended,
      time_elapsed_sec: (time_ended - time_started) / 1000,
      full_response: messages,
    }
  };
};

Finally, we can register this model in Okareo, which we do using the Okareo Typescript SDK.

import { CustomModel, Okareo } from "okareo-ts-sdk"; 
// the invoke method that we defined above 
import { invoke } from "./openai-assistant-model"; 
// keys for the evaluation 
const OKAREO_API_KEY = process.env.OKAREO_API_KEY || "<YOUR_OKAREO_KEY>"; 
// unique identifier for the evaluation 
const PROJECT_NAME = "Global"; 
// get the model 
const okareo = new Okareo({api_key:OKAREO_API_KEY}); 
// name for the Okareo CustomModel0 
const MODEL_NAME = "WebBizz B2B Analyst (OpenAI Assistant)"; 
const model_under_test = await okareo.register_model({
  name: MODEL_NAME,
  project_id: project_id,
  models: {
    type: "custom",
    invoke,     
  } as CustomModel
});

If you have a scenario uploaded to Okareo and one or more check(s) in mind, then you can evaluate your Assistant with the run_test method:

const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
const pData: any[] = await okareo.getProjects();
const project_id = pData.find(p => p.name === PROJECT_NAME)?.id;
const eval_run: any = await model_under_test.run_test({
  name: "My Assistant Evaluation",
  model_api_key: OPENAI_API_KEY,
  project_id: project_id,
  scenario_id: "<YOUR_SCENARIO_ID>",
  type: TestRunType.NL_GENERATION,
  checks: ["<CHECK_NAME_1>", ..., "<CHECK_NAME_N"],
} as RunTestProps );

We have defined a few Assistant evaluations for you to try out on the WebBizz Assistant. Just download the assistant-evaluation-in-ci directory and run one of the following:

  • okareo run -f off-topic-eval

  • okareo run -f prompt-protection-eval

  • okareo run -f competitor-mentions-eval

Conclusion

In this post, we showed you how to set up an OpenAI Assistant with a file_search tool. Then, we registered the OpenAI Assistant as a model in Okareo. Finally, we showed you how to run an evaluation on the registered model.

In a series of future posts, we will dive deeper into the evaluations contained in the repo, including prompt protection, off-topic query handling, and competitor mentions.


Share:

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.