Function-calling Evals in LLMs and Agent Networks
Evaluation

Matt Wyman
,
CEO/Co-Founder

Sarah Barber
,
Senior Technical Writer
January 22, 2025
Function calling is now an essential feature of most LLMs, as it allows them to interact with the outside world. Adding function-calling capability to your LLM-powered application allows it to bridge the gap between natural language generation and practical tool use.
Adding an LLM with function-calling capabilities to your application adds an extra layer of complication when it comes to testing it. As LLMs are non-deterministic, they generally already have to be evaluated separately from your application code.
Evaluating these LLM agents (or sometimes agent networks) when they're making external function calls may seem complicated, but in this article, we’ll explain everything that's involved with evaluating function calls (including in the context of agent networks) and show you how to do a function-calling eval using Okareo.
What is function calling?
Function calling is when an LLM is given the ability to generate instructions that trigger external tools or invoke specific code functions via API endpoints. The actual function calling is done either by your application code or by a specialized task-oriented agent in an agent network.
You can use function calls to enhance the response of your LLM application. For example, if an LLM detects that a user is asking for a weather forecast for New York City, instead of responding with generated text from its sandboxed environment, it can call a check_weather()
function to find out the actual weather forecast for New York and use this to augment its response with real-world data, as well as more user-friendly information, such as graphs or nice formatting.
A function-calling LLM is made aware of all the possible functions it could call via a system prompt or as part of the API call setup. It's usually sent some JSON, containing a list of function names, along with descriptions of what they do and the parameters they accept. For the check_weather()
function, this could look like:
{
"functions": [
{
"name": "check_weather",
"description": "Fetches the current weather conditions or a short-term forecast for a specified location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The name of the location to check the weather for (e.g., city, region)."
},
"units": {
"type": "string",
"description": "The unit system for the weather data. Options are 'metric' (Celsius) or 'imperial' (Fahrenheit).",
"enum": ["metric", "imperial"]
},
"time_period": {
"type": "string",
"description": "The desired time period for weather information. Options are 'current' for now, 'hourly' for hourly forecast, or 'daily' for a short-term daily forecast.",
"enum": ["current", "hourly", "daily"]
}
},
"required": ["location", "time_period"]
}
}
]
}
Depending on your setup, the functions might be functions that you wrote yourself in your application code or ones written by a third-party system.
How function calling works
Function calling works differently depending on the complexity of your setup. First, we'll look at a simple example of a single LLM that does function calling and returns the response to your application, meaning your application code needs to handle everything. Later, we'll cover a more advanced agent network setup where AI handles more of this work.
For the simple example, when a user of your app asks "What's the weather in New York City?", this request goes to a function-calling LLM, which determines which function of all the functions available to it is the most relevant (based on the descriptions of each function). It then generates a function call, which includes the function name and parameters. This is typically JSON, and it gets returned to your application code.
{
"name": "check_weather",
"parameters": {
"location": "New York City",
"units": "imperial",
"time_period": "current"
}
}
Your application would need to parse this JSON and then call the check_weather()
function, which you'd previously implemented in your code, with check_weather("New York City", "imperial", "current")
. This function might call a third-party API to get a weather forecast and then generate a data visualization of the forecast, which could be returned and displayed to your user.
A more common scenario (when a user types a request for the weather) is that they would want a natural language response rather than a series of data or graphs returned to the page. To make this work, you need a series of LLMs working together. This is known as an agent network. You can either build your own agent network or use an agent network framework like AutoGen or CrewAI.
The architecture of an agent network can vary, but a common way to do it has the user interact with a primary "orchestrator" agent, which acts as a manager that delegates tasks to other more specialized sub-agents. These sub-agents are typically LLMs or task-specific agents that can call functions.
In the example below, the orchestrator agent has the ability to generate function calls itself; however, depending on the prompt’s complexity, you could have the orchestrator detect that the user is interested in weather and hand everything related to this off to a weather sub-agent. In this case, the orchestrator is able to generate the check_weather()
function call since it's been made aware of this function as part of the system prompt. The orchestrator receives some weather data in JSON format, but instead of handing it back to the application, it hands this data to another LLM in the network, asking it to send back a natural language response along with a visual chart of hourly temperatures. Once this LLM has responded with natural language and an image, the response is returned to the application.

In an agent network, the function call is just one part of a larger puzzle. In this example, the orchestrator agent is responsible for generating the function call instructions, one of the agents is responsible for actually calling the function, and another agent creates user-friendly text based on the response.
What are the problems with function calling?
The key problem when it comes to LLMs generating function-calling instructions is that you need to be sure that they're generating the correct function names. The more functions an LLM has to choose from, the harder this process becomes.
An LLM uses any context it has available to it to choose the correct function name. This includes user instructions, conversation history, the system prompt, and any instructions from the orchestrator or other agents in the network. Even if the agent generates the correct function name, it can still forget to pass some parameters to it or pass the wrong values or data types for each parameter.
Given all these things that can go wrong, you can't just assume that an LLM will generate function calls correctly every time. That's why you need function-calling evaluation.
What is function-calling evaluation?
Function-calling evaluation, as its name suggests, refers to evaluating everything around how good an LLM (or agent network) is at generating the correct function calls. This includes the function names, whether all parameters are present, and whether the parameters are the correct data types.
If you're dealing with a single LLM, you can analyze the JSON that it returns to check whether it's even generating a syntactically correct function call at all (with a function call name, parameters and data types), and then if it is, whether this includes the correct function name and parameters.
However, if you're using a network of agents, you'll need to analyze two different things:
The JSON response from the agent that generated the function call. This response is usually sent to another agent in the network — one that's responsible for planning or orchestration.
The overall natural language response from the agent network to the user (or your application).
Evaluating an LLM involves analyzing whether its responses show that it's doing the things you want it to do. This generally involves setting up some test cases of sample inputs and the responses they generate and comparing each response with a predetermined "gold standard" expected response. You might have a number of different behaviors that you care about, and each of these need to be checked.
When evaluating function calls, you'll want to ask these questions as a bare minimum:
Is the system returning a syntactically correct function call at all (with a function call name, parameters, and data types)?
Does the function name match?
Do the parameter names, values, and data types match?
You can make your evaluations even more comprehensive by checking things like how your system handles edge cases such as misspellings or whether it can handle multiple or parallel function calls effectively.
Why is function-calling eval important?
Function-calling evaluation is the best way to ensure that your system is doing a good job. It helps you determine whether the LLM (or agent network) is selecting the correct functions and passing valid parameters. Running regular evals also helps avoid regressions, where your system starts to perform worse after the model or system prompts are updated. Evaluations also allow you to be confident that your system understands the user's intent and that it can make use of context such as conversation history to make decisions.
If an eval fails, this highlights potential areas for improvement. For example, if the system isn't calling the correct function, this could be because of a number of possible issues with different parts of the system. A common issue is often that the function schema isn't clear enough. A function-calling LLM is given a schema of all the different function calls that are available, including the names of the calls, a description of what they do, and the parameters they take.
A function-calling evaluation will tell you how good of a job your LLM is doing, and if it's doing a poor job, you can see specific use cases where it fails. For example, you could see which function is being called instead of the correct one. Reading the descriptions of each function may give you insight into why the wrong function was called. You can then update the descriptions of each function so they're less likely to be confused.
How to use Okareo for function-calling eval
Okareo is a tool for evaluating LLMs, and you can use it to evaluate LLMs that generate function calls as well. It consists of an app and either Python or TypeScript SDKs. Here we'll show how to evaluate functions by using an example of account creation and deletion functions for a website. The full working code for this example is hosted on our GitHub.
Start by defining the test cases for your evaluation, which are called scenarios in Okareo. These are a series of example user inputs paired with a "gold standard" result that you define. In this example, we're focusing on evaluating the function call generation ability. To do this, you need to create some scenarios with function calls inside the result fields. A function call typically looks like JSON with the function call name and parameters in it. You need to ensure that you use the exact same JSON structure that your system expects when it calls the functions.
scenario_data = [
SeedData(
input_="can you delete my account? my name is Bob",
result="function": {{"name": "delete_account", "arguments": { "username": "Bob" }, "__required": ["username"]}},
),
SeedData(
input_="can you delete my account? my name is john",
result="function": {{"name": "delete_account", "arguments": { "username": "John" }, "__required": ["username"]}},
),
SeedData(
input_="how do I make an account? my name is Alice",
result="function": {{"name": "create_account", "arguments": { "username": "Alice"}, "__required": ["username"]}},
),
SeedData(
input_="how do I create an account?",
result="function": {{"name": "create_account", "arguments": { "username": ".+" }, "__required": ["username"]}},
),
SeedData(
input_="my name is steve. how do I create a project?",
result="function": {{"name": "create_account", "arguments": { "username": "Steve" }, "__required": ["username"]}},
),
]
tool_scenario = okareo.create_scenario_set(
ScenarioSetCreate(
name=f"Function Call Demo Scenario - {random_string(5)}",
seed_data=scenario_data,
)
)
Once you've created your scenario set, you need to register your model with Okareo. Okareo works with a variety of third-party models such as OpenAI, and these are very simple to register since you typically just need your model’s ID. However, for this example we're going to use a custom “model” (actually a fully deterministic script) as a toy example so we have full control over the output. This makes it easier to understand the evaluation process.
# create the custom model
class FunctionCallModel(CustomModel):
def __init__(self, name):
super().__init__(name)
self.pattern = r'my name is (\S+)'
def invoke(self, input_value):
out = {"tool_calls": []}
function_call = {"name": "unknown"}
# parse out the function name
if "delete" in input_value:
function_call["name"] = "delete_account"
if "create" in input_value:
function_call["name"] = "create_account"
# parse out the function argument
match = re.search(self.pattern, input_value)
if match:
username = match.group(1)
function_call["arguments"] = {"username": username}
tool_call = {
"function": function_call
}
# package the tool call and return
out["tool_calls"].append(tool_call)
return ModelInvocation(
model_input=input_value,
tool_calls=out["tool_calls"]
)
# Register the model
model_under_test = okareo.register_model(
name="Fake model that simulates function calling",
model=[FunctionCallModel(name=FunctionCallModel.__name__)],
update=True
)
Finally, you can run an evaluation on your model, passing in a series of named checks. Okareo has created a number of out-of-the-box checks that you can use by passing in their names, including some that are specifically for evaluating function calling, but you can also create your own custom checks. The checks listed below are built into Okareo.
evaluation = model_under_test.run_test(
name="Function call evaluation",
scenario=tool_scenario.scenario_id,
test_run_type=TestRunType.NL_GENERATION,
checks=["function_call_validator",
"is_function_correct",
"are_all_params_expected",
"are_required_params_present",
"do_param_values_match"]
)
Below is an explanation of what each function call check does:

To run your Okareo code, you'll need to have Okareo installed locally and run the okareo run command. For more information on this, follow the instructions in the README.
Once you've run an Okareo evaluation, you can view the results inside the app in the Evaluations tab, or you can have your code print out a direct link to your evaluation's results page like so:
print(f"See results in Okareo: {evaluation.app_link}")
When viewing the results page, you can see an overview of how well your LLM (or, in this case, our pretend model script) has performed on each check, as well as specific results for each individual scenario. You don't have to view these results visually; you can also choose to read them programmatically. For example, you can set up automations to run regular evaluations as part of your CI workflow and inform you of any regressions.

Clicking on each individual row allows you to view the input, expected result, and tools invoked, which can help you debug why there was a failure.

In the screen capture above, you can see that the second row failed because no name was given in the input. You can use this information to make your system more robust. If you don't want the function call to fail in this scenario, you could make the username parameter optional and have the function supply a default parameter in cases where there is no value, such as "New User123." The user can then later be forced to give a real name when they first log into their account.
Our example scenario set contains a mixture of scenarios, some of which will fail different checks, allowing you to see some failures in action and understand how these failures will help you improve your system. For example, you can see that rows 1 and 4 have both failed the do_param_values_match
check (because the name "john" is lowercase and the name "steve." is lowercase and has a period at the end). Seeing this, you might choose to convert all names to lowercase and remove trailing periods or whitespace).
You can also see that row 3 has failed the is_function_correct check
because the user used the term "make an account" instead of "create an account." In this case, updating the description of the function call in the schema or adding more information to the system prompt might help the LLM understand that these phrases are equivalent.
Use Okareo's function-calling eval alongside evaluation of all your LLM's behaviors
Function-calling evaluation is just one aspect of the evaluations you need to do for your LLMs. You can use Okareo to evaluate the output of any LLM, whether that's natural language, JSON, or code. It can also evaluate multi-turn conversations between a user and an agent and between agents within an agent network.
You can try out many of these features by installing and signing up for Okareo today.
Function calling is now an essential feature of most LLMs, as it allows them to interact with the outside world. Adding function-calling capability to your LLM-powered application allows it to bridge the gap between natural language generation and practical tool use.
Adding an LLM with function-calling capabilities to your application adds an extra layer of complication when it comes to testing it. As LLMs are non-deterministic, they generally already have to be evaluated separately from your application code.
Evaluating these LLM agents (or sometimes agent networks) when they're making external function calls may seem complicated, but in this article, we’ll explain everything that's involved with evaluating function calls (including in the context of agent networks) and show you how to do a function-calling eval using Okareo.
What is function calling?
Function calling is when an LLM is given the ability to generate instructions that trigger external tools or invoke specific code functions via API endpoints. The actual function calling is done either by your application code or by a specialized task-oriented agent in an agent network.
You can use function calls to enhance the response of your LLM application. For example, if an LLM detects that a user is asking for a weather forecast for New York City, instead of responding with generated text from its sandboxed environment, it can call a check_weather()
function to find out the actual weather forecast for New York and use this to augment its response with real-world data, as well as more user-friendly information, such as graphs or nice formatting.
A function-calling LLM is made aware of all the possible functions it could call via a system prompt or as part of the API call setup. It's usually sent some JSON, containing a list of function names, along with descriptions of what they do and the parameters they accept. For the check_weather()
function, this could look like:
{
"functions": [
{
"name": "check_weather",
"description": "Fetches the current weather conditions or a short-term forecast for a specified location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The name of the location to check the weather for (e.g., city, region)."
},
"units": {
"type": "string",
"description": "The unit system for the weather data. Options are 'metric' (Celsius) or 'imperial' (Fahrenheit).",
"enum": ["metric", "imperial"]
},
"time_period": {
"type": "string",
"description": "The desired time period for weather information. Options are 'current' for now, 'hourly' for hourly forecast, or 'daily' for a short-term daily forecast.",
"enum": ["current", "hourly", "daily"]
}
},
"required": ["location", "time_period"]
}
}
]
}
Depending on your setup, the functions might be functions that you wrote yourself in your application code or ones written by a third-party system.
How function calling works
Function calling works differently depending on the complexity of your setup. First, we'll look at a simple example of a single LLM that does function calling and returns the response to your application, meaning your application code needs to handle everything. Later, we'll cover a more advanced agent network setup where AI handles more of this work.
For the simple example, when a user of your app asks "What's the weather in New York City?", this request goes to a function-calling LLM, which determines which function of all the functions available to it is the most relevant (based on the descriptions of each function). It then generates a function call, which includes the function name and parameters. This is typically JSON, and it gets returned to your application code.
{
"name": "check_weather",
"parameters": {
"location": "New York City",
"units": "imperial",
"time_period": "current"
}
}
Your application would need to parse this JSON and then call the check_weather()
function, which you'd previously implemented in your code, with check_weather("New York City", "imperial", "current")
. This function might call a third-party API to get a weather forecast and then generate a data visualization of the forecast, which could be returned and displayed to your user.
A more common scenario (when a user types a request for the weather) is that they would want a natural language response rather than a series of data or graphs returned to the page. To make this work, you need a series of LLMs working together. This is known as an agent network. You can either build your own agent network or use an agent network framework like AutoGen or CrewAI.
The architecture of an agent network can vary, but a common way to do it has the user interact with a primary "orchestrator" agent, which acts as a manager that delegates tasks to other more specialized sub-agents. These sub-agents are typically LLMs or task-specific agents that can call functions.
In the example below, the orchestrator agent has the ability to generate function calls itself; however, depending on the prompt’s complexity, you could have the orchestrator detect that the user is interested in weather and hand everything related to this off to a weather sub-agent. In this case, the orchestrator is able to generate the check_weather()
function call since it's been made aware of this function as part of the system prompt. The orchestrator receives some weather data in JSON format, but instead of handing it back to the application, it hands this data to another LLM in the network, asking it to send back a natural language response along with a visual chart of hourly temperatures. Once this LLM has responded with natural language and an image, the response is returned to the application.

In an agent network, the function call is just one part of a larger puzzle. In this example, the orchestrator agent is responsible for generating the function call instructions, one of the agents is responsible for actually calling the function, and another agent creates user-friendly text based on the response.
What are the problems with function calling?
The key problem when it comes to LLMs generating function-calling instructions is that you need to be sure that they're generating the correct function names. The more functions an LLM has to choose from, the harder this process becomes.
An LLM uses any context it has available to it to choose the correct function name. This includes user instructions, conversation history, the system prompt, and any instructions from the orchestrator or other agents in the network. Even if the agent generates the correct function name, it can still forget to pass some parameters to it or pass the wrong values or data types for each parameter.
Given all these things that can go wrong, you can't just assume that an LLM will generate function calls correctly every time. That's why you need function-calling evaluation.
What is function-calling evaluation?
Function-calling evaluation, as its name suggests, refers to evaluating everything around how good an LLM (or agent network) is at generating the correct function calls. This includes the function names, whether all parameters are present, and whether the parameters are the correct data types.
If you're dealing with a single LLM, you can analyze the JSON that it returns to check whether it's even generating a syntactically correct function call at all (with a function call name, parameters and data types), and then if it is, whether this includes the correct function name and parameters.
However, if you're using a network of agents, you'll need to analyze two different things:
The JSON response from the agent that generated the function call. This response is usually sent to another agent in the network — one that's responsible for planning or orchestration.
The overall natural language response from the agent network to the user (or your application).
Evaluating an LLM involves analyzing whether its responses show that it's doing the things you want it to do. This generally involves setting up some test cases of sample inputs and the responses they generate and comparing each response with a predetermined "gold standard" expected response. You might have a number of different behaviors that you care about, and each of these need to be checked.
When evaluating function calls, you'll want to ask these questions as a bare minimum:
Is the system returning a syntactically correct function call at all (with a function call name, parameters, and data types)?
Does the function name match?
Do the parameter names, values, and data types match?
You can make your evaluations even more comprehensive by checking things like how your system handles edge cases such as misspellings or whether it can handle multiple or parallel function calls effectively.
Why is function-calling eval important?
Function-calling evaluation is the best way to ensure that your system is doing a good job. It helps you determine whether the LLM (or agent network) is selecting the correct functions and passing valid parameters. Running regular evals also helps avoid regressions, where your system starts to perform worse after the model or system prompts are updated. Evaluations also allow you to be confident that your system understands the user's intent and that it can make use of context such as conversation history to make decisions.
If an eval fails, this highlights potential areas for improvement. For example, if the system isn't calling the correct function, this could be because of a number of possible issues with different parts of the system. A common issue is often that the function schema isn't clear enough. A function-calling LLM is given a schema of all the different function calls that are available, including the names of the calls, a description of what they do, and the parameters they take.
A function-calling evaluation will tell you how good of a job your LLM is doing, and if it's doing a poor job, you can see specific use cases where it fails. For example, you could see which function is being called instead of the correct one. Reading the descriptions of each function may give you insight into why the wrong function was called. You can then update the descriptions of each function so they're less likely to be confused.
How to use Okareo for function-calling eval
Okareo is a tool for evaluating LLMs, and you can use it to evaluate LLMs that generate function calls as well. It consists of an app and either Python or TypeScript SDKs. Here we'll show how to evaluate functions by using an example of account creation and deletion functions for a website. The full working code for this example is hosted on our GitHub.
Start by defining the test cases for your evaluation, which are called scenarios in Okareo. These are a series of example user inputs paired with a "gold standard" result that you define. In this example, we're focusing on evaluating the function call generation ability. To do this, you need to create some scenarios with function calls inside the result fields. A function call typically looks like JSON with the function call name and parameters in it. You need to ensure that you use the exact same JSON structure that your system expects when it calls the functions.
scenario_data = [
SeedData(
input_="can you delete my account? my name is Bob",
result="function": {{"name": "delete_account", "arguments": { "username": "Bob" }, "__required": ["username"]}},
),
SeedData(
input_="can you delete my account? my name is john",
result="function": {{"name": "delete_account", "arguments": { "username": "John" }, "__required": ["username"]}},
),
SeedData(
input_="how do I make an account? my name is Alice",
result="function": {{"name": "create_account", "arguments": { "username": "Alice"}, "__required": ["username"]}},
),
SeedData(
input_="how do I create an account?",
result="function": {{"name": "create_account", "arguments": { "username": ".+" }, "__required": ["username"]}},
),
SeedData(
input_="my name is steve. how do I create a project?",
result="function": {{"name": "create_account", "arguments": { "username": "Steve" }, "__required": ["username"]}},
),
]
tool_scenario = okareo.create_scenario_set(
ScenarioSetCreate(
name=f"Function Call Demo Scenario - {random_string(5)}",
seed_data=scenario_data,
)
)
Once you've created your scenario set, you need to register your model with Okareo. Okareo works with a variety of third-party models such as OpenAI, and these are very simple to register since you typically just need your model’s ID. However, for this example we're going to use a custom “model” (actually a fully deterministic script) as a toy example so we have full control over the output. This makes it easier to understand the evaluation process.
# create the custom model
class FunctionCallModel(CustomModel):
def __init__(self, name):
super().__init__(name)
self.pattern = r'my name is (\S+)'
def invoke(self, input_value):
out = {"tool_calls": []}
function_call = {"name": "unknown"}
# parse out the function name
if "delete" in input_value:
function_call["name"] = "delete_account"
if "create" in input_value:
function_call["name"] = "create_account"
# parse out the function argument
match = re.search(self.pattern, input_value)
if match:
username = match.group(1)
function_call["arguments"] = {"username": username}
tool_call = {
"function": function_call
}
# package the tool call and return
out["tool_calls"].append(tool_call)
return ModelInvocation(
model_input=input_value,
tool_calls=out["tool_calls"]
)
# Register the model
model_under_test = okareo.register_model(
name="Fake model that simulates function calling",
model=[FunctionCallModel(name=FunctionCallModel.__name__)],
update=True
)
Finally, you can run an evaluation on your model, passing in a series of named checks. Okareo has created a number of out-of-the-box checks that you can use by passing in their names, including some that are specifically for evaluating function calling, but you can also create your own custom checks. The checks listed below are built into Okareo.
evaluation = model_under_test.run_test(
name="Function call evaluation",
scenario=tool_scenario.scenario_id,
test_run_type=TestRunType.NL_GENERATION,
checks=["function_call_validator",
"is_function_correct",
"are_all_params_expected",
"are_required_params_present",
"do_param_values_match"]
)
Below is an explanation of what each function call check does:

To run your Okareo code, you'll need to have Okareo installed locally and run the okareo run command. For more information on this, follow the instructions in the README.
Once you've run an Okareo evaluation, you can view the results inside the app in the Evaluations tab, or you can have your code print out a direct link to your evaluation's results page like so:
print(f"See results in Okareo: {evaluation.app_link}")
When viewing the results page, you can see an overview of how well your LLM (or, in this case, our pretend model script) has performed on each check, as well as specific results for each individual scenario. You don't have to view these results visually; you can also choose to read them programmatically. For example, you can set up automations to run regular evaluations as part of your CI workflow and inform you of any regressions.

Clicking on each individual row allows you to view the input, expected result, and tools invoked, which can help you debug why there was a failure.

In the screen capture above, you can see that the second row failed because no name was given in the input. You can use this information to make your system more robust. If you don't want the function call to fail in this scenario, you could make the username parameter optional and have the function supply a default parameter in cases where there is no value, such as "New User123." The user can then later be forced to give a real name when they first log into their account.
Our example scenario set contains a mixture of scenarios, some of which will fail different checks, allowing you to see some failures in action and understand how these failures will help you improve your system. For example, you can see that rows 1 and 4 have both failed the do_param_values_match
check (because the name "john" is lowercase and the name "steve." is lowercase and has a period at the end). Seeing this, you might choose to convert all names to lowercase and remove trailing periods or whitespace).
You can also see that row 3 has failed the is_function_correct check
because the user used the term "make an account" instead of "create an account." In this case, updating the description of the function call in the schema or adding more information to the system prompt might help the LLM understand that these phrases are equivalent.
Use Okareo's function-calling eval alongside evaluation of all your LLM's behaviors
Function-calling evaluation is just one aspect of the evaluations you need to do for your LLMs. You can use Okareo to evaluate the output of any LLM, whether that's natural language, JSON, or code. It can also evaluate multi-turn conversations between a user and an agent and between agents within an agent network.
You can try out many of these features by installing and signing up for Okareo today.
Function calling is now an essential feature of most LLMs, as it allows them to interact with the outside world. Adding function-calling capability to your LLM-powered application allows it to bridge the gap between natural language generation and practical tool use.
Adding an LLM with function-calling capabilities to your application adds an extra layer of complication when it comes to testing it. As LLMs are non-deterministic, they generally already have to be evaluated separately from your application code.
Evaluating these LLM agents (or sometimes agent networks) when they're making external function calls may seem complicated, but in this article, we’ll explain everything that's involved with evaluating function calls (including in the context of agent networks) and show you how to do a function-calling eval using Okareo.
What is function calling?
Function calling is when an LLM is given the ability to generate instructions that trigger external tools or invoke specific code functions via API endpoints. The actual function calling is done either by your application code or by a specialized task-oriented agent in an agent network.
You can use function calls to enhance the response of your LLM application. For example, if an LLM detects that a user is asking for a weather forecast for New York City, instead of responding with generated text from its sandboxed environment, it can call a check_weather()
function to find out the actual weather forecast for New York and use this to augment its response with real-world data, as well as more user-friendly information, such as graphs or nice formatting.
A function-calling LLM is made aware of all the possible functions it could call via a system prompt or as part of the API call setup. It's usually sent some JSON, containing a list of function names, along with descriptions of what they do and the parameters they accept. For the check_weather()
function, this could look like:
{
"functions": [
{
"name": "check_weather",
"description": "Fetches the current weather conditions or a short-term forecast for a specified location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The name of the location to check the weather for (e.g., city, region)."
},
"units": {
"type": "string",
"description": "The unit system for the weather data. Options are 'metric' (Celsius) or 'imperial' (Fahrenheit).",
"enum": ["metric", "imperial"]
},
"time_period": {
"type": "string",
"description": "The desired time period for weather information. Options are 'current' for now, 'hourly' for hourly forecast, or 'daily' for a short-term daily forecast.",
"enum": ["current", "hourly", "daily"]
}
},
"required": ["location", "time_period"]
}
}
]
}
Depending on your setup, the functions might be functions that you wrote yourself in your application code or ones written by a third-party system.
How function calling works
Function calling works differently depending on the complexity of your setup. First, we'll look at a simple example of a single LLM that does function calling and returns the response to your application, meaning your application code needs to handle everything. Later, we'll cover a more advanced agent network setup where AI handles more of this work.
For the simple example, when a user of your app asks "What's the weather in New York City?", this request goes to a function-calling LLM, which determines which function of all the functions available to it is the most relevant (based on the descriptions of each function). It then generates a function call, which includes the function name and parameters. This is typically JSON, and it gets returned to your application code.
{
"name": "check_weather",
"parameters": {
"location": "New York City",
"units": "imperial",
"time_period": "current"
}
}
Your application would need to parse this JSON and then call the check_weather()
function, which you'd previously implemented in your code, with check_weather("New York City", "imperial", "current")
. This function might call a third-party API to get a weather forecast and then generate a data visualization of the forecast, which could be returned and displayed to your user.
A more common scenario (when a user types a request for the weather) is that they would want a natural language response rather than a series of data or graphs returned to the page. To make this work, you need a series of LLMs working together. This is known as an agent network. You can either build your own agent network or use an agent network framework like AutoGen or CrewAI.
The architecture of an agent network can vary, but a common way to do it has the user interact with a primary "orchestrator" agent, which acts as a manager that delegates tasks to other more specialized sub-agents. These sub-agents are typically LLMs or task-specific agents that can call functions.
In the example below, the orchestrator agent has the ability to generate function calls itself; however, depending on the prompt’s complexity, you could have the orchestrator detect that the user is interested in weather and hand everything related to this off to a weather sub-agent. In this case, the orchestrator is able to generate the check_weather()
function call since it's been made aware of this function as part of the system prompt. The orchestrator receives some weather data in JSON format, but instead of handing it back to the application, it hands this data to another LLM in the network, asking it to send back a natural language response along with a visual chart of hourly temperatures. Once this LLM has responded with natural language and an image, the response is returned to the application.

In an agent network, the function call is just one part of a larger puzzle. In this example, the orchestrator agent is responsible for generating the function call instructions, one of the agents is responsible for actually calling the function, and another agent creates user-friendly text based on the response.
What are the problems with function calling?
The key problem when it comes to LLMs generating function-calling instructions is that you need to be sure that they're generating the correct function names. The more functions an LLM has to choose from, the harder this process becomes.
An LLM uses any context it has available to it to choose the correct function name. This includes user instructions, conversation history, the system prompt, and any instructions from the orchestrator or other agents in the network. Even if the agent generates the correct function name, it can still forget to pass some parameters to it or pass the wrong values or data types for each parameter.
Given all these things that can go wrong, you can't just assume that an LLM will generate function calls correctly every time. That's why you need function-calling evaluation.
What is function-calling evaluation?
Function-calling evaluation, as its name suggests, refers to evaluating everything around how good an LLM (or agent network) is at generating the correct function calls. This includes the function names, whether all parameters are present, and whether the parameters are the correct data types.
If you're dealing with a single LLM, you can analyze the JSON that it returns to check whether it's even generating a syntactically correct function call at all (with a function call name, parameters and data types), and then if it is, whether this includes the correct function name and parameters.
However, if you're using a network of agents, you'll need to analyze two different things:
The JSON response from the agent that generated the function call. This response is usually sent to another agent in the network — one that's responsible for planning or orchestration.
The overall natural language response from the agent network to the user (or your application).
Evaluating an LLM involves analyzing whether its responses show that it's doing the things you want it to do. This generally involves setting up some test cases of sample inputs and the responses they generate and comparing each response with a predetermined "gold standard" expected response. You might have a number of different behaviors that you care about, and each of these need to be checked.
When evaluating function calls, you'll want to ask these questions as a bare minimum:
Is the system returning a syntactically correct function call at all (with a function call name, parameters, and data types)?
Does the function name match?
Do the parameter names, values, and data types match?
You can make your evaluations even more comprehensive by checking things like how your system handles edge cases such as misspellings or whether it can handle multiple or parallel function calls effectively.
Why is function-calling eval important?
Function-calling evaluation is the best way to ensure that your system is doing a good job. It helps you determine whether the LLM (or agent network) is selecting the correct functions and passing valid parameters. Running regular evals also helps avoid regressions, where your system starts to perform worse after the model or system prompts are updated. Evaluations also allow you to be confident that your system understands the user's intent and that it can make use of context such as conversation history to make decisions.
If an eval fails, this highlights potential areas for improvement. For example, if the system isn't calling the correct function, this could be because of a number of possible issues with different parts of the system. A common issue is often that the function schema isn't clear enough. A function-calling LLM is given a schema of all the different function calls that are available, including the names of the calls, a description of what they do, and the parameters they take.
A function-calling evaluation will tell you how good of a job your LLM is doing, and if it's doing a poor job, you can see specific use cases where it fails. For example, you could see which function is being called instead of the correct one. Reading the descriptions of each function may give you insight into why the wrong function was called. You can then update the descriptions of each function so they're less likely to be confused.
How to use Okareo for function-calling eval
Okareo is a tool for evaluating LLMs, and you can use it to evaluate LLMs that generate function calls as well. It consists of an app and either Python or TypeScript SDKs. Here we'll show how to evaluate functions by using an example of account creation and deletion functions for a website. The full working code for this example is hosted on our GitHub.
Start by defining the test cases for your evaluation, which are called scenarios in Okareo. These are a series of example user inputs paired with a "gold standard" result that you define. In this example, we're focusing on evaluating the function call generation ability. To do this, you need to create some scenarios with function calls inside the result fields. A function call typically looks like JSON with the function call name and parameters in it. You need to ensure that you use the exact same JSON structure that your system expects when it calls the functions.
scenario_data = [
SeedData(
input_="can you delete my account? my name is Bob",
result="function": {{"name": "delete_account", "arguments": { "username": "Bob" }, "__required": ["username"]}},
),
SeedData(
input_="can you delete my account? my name is john",
result="function": {{"name": "delete_account", "arguments": { "username": "John" }, "__required": ["username"]}},
),
SeedData(
input_="how do I make an account? my name is Alice",
result="function": {{"name": "create_account", "arguments": { "username": "Alice"}, "__required": ["username"]}},
),
SeedData(
input_="how do I create an account?",
result="function": {{"name": "create_account", "arguments": { "username": ".+" }, "__required": ["username"]}},
),
SeedData(
input_="my name is steve. how do I create a project?",
result="function": {{"name": "create_account", "arguments": { "username": "Steve" }, "__required": ["username"]}},
),
]
tool_scenario = okareo.create_scenario_set(
ScenarioSetCreate(
name=f"Function Call Demo Scenario - {random_string(5)}",
seed_data=scenario_data,
)
)
Once you've created your scenario set, you need to register your model with Okareo. Okareo works with a variety of third-party models such as OpenAI, and these are very simple to register since you typically just need your model’s ID. However, for this example we're going to use a custom “model” (actually a fully deterministic script) as a toy example so we have full control over the output. This makes it easier to understand the evaluation process.
# create the custom model
class FunctionCallModel(CustomModel):
def __init__(self, name):
super().__init__(name)
self.pattern = r'my name is (\S+)'
def invoke(self, input_value):
out = {"tool_calls": []}
function_call = {"name": "unknown"}
# parse out the function name
if "delete" in input_value:
function_call["name"] = "delete_account"
if "create" in input_value:
function_call["name"] = "create_account"
# parse out the function argument
match = re.search(self.pattern, input_value)
if match:
username = match.group(1)
function_call["arguments"] = {"username": username}
tool_call = {
"function": function_call
}
# package the tool call and return
out["tool_calls"].append(tool_call)
return ModelInvocation(
model_input=input_value,
tool_calls=out["tool_calls"]
)
# Register the model
model_under_test = okareo.register_model(
name="Fake model that simulates function calling",
model=[FunctionCallModel(name=FunctionCallModel.__name__)],
update=True
)
Finally, you can run an evaluation on your model, passing in a series of named checks. Okareo has created a number of out-of-the-box checks that you can use by passing in their names, including some that are specifically for evaluating function calling, but you can also create your own custom checks. The checks listed below are built into Okareo.
evaluation = model_under_test.run_test(
name="Function call evaluation",
scenario=tool_scenario.scenario_id,
test_run_type=TestRunType.NL_GENERATION,
checks=["function_call_validator",
"is_function_correct",
"are_all_params_expected",
"are_required_params_present",
"do_param_values_match"]
)
Below is an explanation of what each function call check does:

To run your Okareo code, you'll need to have Okareo installed locally and run the okareo run command. For more information on this, follow the instructions in the README.
Once you've run an Okareo evaluation, you can view the results inside the app in the Evaluations tab, or you can have your code print out a direct link to your evaluation's results page like so:
print(f"See results in Okareo: {evaluation.app_link}")
When viewing the results page, you can see an overview of how well your LLM (or, in this case, our pretend model script) has performed on each check, as well as specific results for each individual scenario. You don't have to view these results visually; you can also choose to read them programmatically. For example, you can set up automations to run regular evaluations as part of your CI workflow and inform you of any regressions.

Clicking on each individual row allows you to view the input, expected result, and tools invoked, which can help you debug why there was a failure.

In the screen capture above, you can see that the second row failed because no name was given in the input. You can use this information to make your system more robust. If you don't want the function call to fail in this scenario, you could make the username parameter optional and have the function supply a default parameter in cases where there is no value, such as "New User123." The user can then later be forced to give a real name when they first log into their account.
Our example scenario set contains a mixture of scenarios, some of which will fail different checks, allowing you to see some failures in action and understand how these failures will help you improve your system. For example, you can see that rows 1 and 4 have both failed the do_param_values_match
check (because the name "john" is lowercase and the name "steve." is lowercase and has a period at the end). Seeing this, you might choose to convert all names to lowercase and remove trailing periods or whitespace).
You can also see that row 3 has failed the is_function_correct check
because the user used the term "make an account" instead of "create an account." In this case, updating the description of the function call in the schema or adding more information to the system prompt might help the LLM understand that these phrases are equivalent.
Use Okareo's function-calling eval alongside evaluation of all your LLM's behaviors
Function-calling evaluation is just one aspect of the evaluations you need to do for your LLMs. You can use Okareo to evaluate the output of any LLM, whether that's natural language, JSON, or code. It can also evaluate multi-turn conversations between a user and an agent and between agents within an agent network.
You can try out many of these features by installing and signing up for Okareo today.
Function calling is now an essential feature of most LLMs, as it allows them to interact with the outside world. Adding function-calling capability to your LLM-powered application allows it to bridge the gap between natural language generation and practical tool use.
Adding an LLM with function-calling capabilities to your application adds an extra layer of complication when it comes to testing it. As LLMs are non-deterministic, they generally already have to be evaluated separately from your application code.
Evaluating these LLM agents (or sometimes agent networks) when they're making external function calls may seem complicated, but in this article, we’ll explain everything that's involved with evaluating function calls (including in the context of agent networks) and show you how to do a function-calling eval using Okareo.
What is function calling?
Function calling is when an LLM is given the ability to generate instructions that trigger external tools or invoke specific code functions via API endpoints. The actual function calling is done either by your application code or by a specialized task-oriented agent in an agent network.
You can use function calls to enhance the response of your LLM application. For example, if an LLM detects that a user is asking for a weather forecast for New York City, instead of responding with generated text from its sandboxed environment, it can call a check_weather()
function to find out the actual weather forecast for New York and use this to augment its response with real-world data, as well as more user-friendly information, such as graphs or nice formatting.
A function-calling LLM is made aware of all the possible functions it could call via a system prompt or as part of the API call setup. It's usually sent some JSON, containing a list of function names, along with descriptions of what they do and the parameters they accept. For the check_weather()
function, this could look like:
{
"functions": [
{
"name": "check_weather",
"description": "Fetches the current weather conditions or a short-term forecast for a specified location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The name of the location to check the weather for (e.g., city, region)."
},
"units": {
"type": "string",
"description": "The unit system for the weather data. Options are 'metric' (Celsius) or 'imperial' (Fahrenheit).",
"enum": ["metric", "imperial"]
},
"time_period": {
"type": "string",
"description": "The desired time period for weather information. Options are 'current' for now, 'hourly' for hourly forecast, or 'daily' for a short-term daily forecast.",
"enum": ["current", "hourly", "daily"]
}
},
"required": ["location", "time_period"]
}
}
]
}
Depending on your setup, the functions might be functions that you wrote yourself in your application code or ones written by a third-party system.
How function calling works
Function calling works differently depending on the complexity of your setup. First, we'll look at a simple example of a single LLM that does function calling and returns the response to your application, meaning your application code needs to handle everything. Later, we'll cover a more advanced agent network setup where AI handles more of this work.
For the simple example, when a user of your app asks "What's the weather in New York City?", this request goes to a function-calling LLM, which determines which function of all the functions available to it is the most relevant (based on the descriptions of each function). It then generates a function call, which includes the function name and parameters. This is typically JSON, and it gets returned to your application code.
{
"name": "check_weather",
"parameters": {
"location": "New York City",
"units": "imperial",
"time_period": "current"
}
}
Your application would need to parse this JSON and then call the check_weather()
function, which you'd previously implemented in your code, with check_weather("New York City", "imperial", "current")
. This function might call a third-party API to get a weather forecast and then generate a data visualization of the forecast, which could be returned and displayed to your user.
A more common scenario (when a user types a request for the weather) is that they would want a natural language response rather than a series of data or graphs returned to the page. To make this work, you need a series of LLMs working together. This is known as an agent network. You can either build your own agent network or use an agent network framework like AutoGen or CrewAI.
The architecture of an agent network can vary, but a common way to do it has the user interact with a primary "orchestrator" agent, which acts as a manager that delegates tasks to other more specialized sub-agents. These sub-agents are typically LLMs or task-specific agents that can call functions.
In the example below, the orchestrator agent has the ability to generate function calls itself; however, depending on the prompt’s complexity, you could have the orchestrator detect that the user is interested in weather and hand everything related to this off to a weather sub-agent. In this case, the orchestrator is able to generate the check_weather()
function call since it's been made aware of this function as part of the system prompt. The orchestrator receives some weather data in JSON format, but instead of handing it back to the application, it hands this data to another LLM in the network, asking it to send back a natural language response along with a visual chart of hourly temperatures. Once this LLM has responded with natural language and an image, the response is returned to the application.

In an agent network, the function call is just one part of a larger puzzle. In this example, the orchestrator agent is responsible for generating the function call instructions, one of the agents is responsible for actually calling the function, and another agent creates user-friendly text based on the response.
What are the problems with function calling?
The key problem when it comes to LLMs generating function-calling instructions is that you need to be sure that they're generating the correct function names. The more functions an LLM has to choose from, the harder this process becomes.
An LLM uses any context it has available to it to choose the correct function name. This includes user instructions, conversation history, the system prompt, and any instructions from the orchestrator or other agents in the network. Even if the agent generates the correct function name, it can still forget to pass some parameters to it or pass the wrong values or data types for each parameter.
Given all these things that can go wrong, you can't just assume that an LLM will generate function calls correctly every time. That's why you need function-calling evaluation.
What is function-calling evaluation?
Function-calling evaluation, as its name suggests, refers to evaluating everything around how good an LLM (or agent network) is at generating the correct function calls. This includes the function names, whether all parameters are present, and whether the parameters are the correct data types.
If you're dealing with a single LLM, you can analyze the JSON that it returns to check whether it's even generating a syntactically correct function call at all (with a function call name, parameters and data types), and then if it is, whether this includes the correct function name and parameters.
However, if you're using a network of agents, you'll need to analyze two different things:
The JSON response from the agent that generated the function call. This response is usually sent to another agent in the network — one that's responsible for planning or orchestration.
The overall natural language response from the agent network to the user (or your application).
Evaluating an LLM involves analyzing whether its responses show that it's doing the things you want it to do. This generally involves setting up some test cases of sample inputs and the responses they generate and comparing each response with a predetermined "gold standard" expected response. You might have a number of different behaviors that you care about, and each of these need to be checked.
When evaluating function calls, you'll want to ask these questions as a bare minimum:
Is the system returning a syntactically correct function call at all (with a function call name, parameters, and data types)?
Does the function name match?
Do the parameter names, values, and data types match?
You can make your evaluations even more comprehensive by checking things like how your system handles edge cases such as misspellings or whether it can handle multiple or parallel function calls effectively.
Why is function-calling eval important?
Function-calling evaluation is the best way to ensure that your system is doing a good job. It helps you determine whether the LLM (or agent network) is selecting the correct functions and passing valid parameters. Running regular evals also helps avoid regressions, where your system starts to perform worse after the model or system prompts are updated. Evaluations also allow you to be confident that your system understands the user's intent and that it can make use of context such as conversation history to make decisions.
If an eval fails, this highlights potential areas for improvement. For example, if the system isn't calling the correct function, this could be because of a number of possible issues with different parts of the system. A common issue is often that the function schema isn't clear enough. A function-calling LLM is given a schema of all the different function calls that are available, including the names of the calls, a description of what they do, and the parameters they take.
A function-calling evaluation will tell you how good of a job your LLM is doing, and if it's doing a poor job, you can see specific use cases where it fails. For example, you could see which function is being called instead of the correct one. Reading the descriptions of each function may give you insight into why the wrong function was called. You can then update the descriptions of each function so they're less likely to be confused.
How to use Okareo for function-calling eval
Okareo is a tool for evaluating LLMs, and you can use it to evaluate LLMs that generate function calls as well. It consists of an app and either Python or TypeScript SDKs. Here we'll show how to evaluate functions by using an example of account creation and deletion functions for a website. The full working code for this example is hosted on our GitHub.
Start by defining the test cases for your evaluation, which are called scenarios in Okareo. These are a series of example user inputs paired with a "gold standard" result that you define. In this example, we're focusing on evaluating the function call generation ability. To do this, you need to create some scenarios with function calls inside the result fields. A function call typically looks like JSON with the function call name and parameters in it. You need to ensure that you use the exact same JSON structure that your system expects when it calls the functions.
scenario_data = [
SeedData(
input_="can you delete my account? my name is Bob",
result="function": {{"name": "delete_account", "arguments": { "username": "Bob" }, "__required": ["username"]}},
),
SeedData(
input_="can you delete my account? my name is john",
result="function": {{"name": "delete_account", "arguments": { "username": "John" }, "__required": ["username"]}},
),
SeedData(
input_="how do I make an account? my name is Alice",
result="function": {{"name": "create_account", "arguments": { "username": "Alice"}, "__required": ["username"]}},
),
SeedData(
input_="how do I create an account?",
result="function": {{"name": "create_account", "arguments": { "username": ".+" }, "__required": ["username"]}},
),
SeedData(
input_="my name is steve. how do I create a project?",
result="function": {{"name": "create_account", "arguments": { "username": "Steve" }, "__required": ["username"]}},
),
]
tool_scenario = okareo.create_scenario_set(
ScenarioSetCreate(
name=f"Function Call Demo Scenario - {random_string(5)}",
seed_data=scenario_data,
)
)
Once you've created your scenario set, you need to register your model with Okareo. Okareo works with a variety of third-party models such as OpenAI, and these are very simple to register since you typically just need your model’s ID. However, for this example we're going to use a custom “model” (actually a fully deterministic script) as a toy example so we have full control over the output. This makes it easier to understand the evaluation process.
# create the custom model
class FunctionCallModel(CustomModel):
def __init__(self, name):
super().__init__(name)
self.pattern = r'my name is (\S+)'
def invoke(self, input_value):
out = {"tool_calls": []}
function_call = {"name": "unknown"}
# parse out the function name
if "delete" in input_value:
function_call["name"] = "delete_account"
if "create" in input_value:
function_call["name"] = "create_account"
# parse out the function argument
match = re.search(self.pattern, input_value)
if match:
username = match.group(1)
function_call["arguments"] = {"username": username}
tool_call = {
"function": function_call
}
# package the tool call and return
out["tool_calls"].append(tool_call)
return ModelInvocation(
model_input=input_value,
tool_calls=out["tool_calls"]
)
# Register the model
model_under_test = okareo.register_model(
name="Fake model that simulates function calling",
model=[FunctionCallModel(name=FunctionCallModel.__name__)],
update=True
)
Finally, you can run an evaluation on your model, passing in a series of named checks. Okareo has created a number of out-of-the-box checks that you can use by passing in their names, including some that are specifically for evaluating function calling, but you can also create your own custom checks. The checks listed below are built into Okareo.
evaluation = model_under_test.run_test(
name="Function call evaluation",
scenario=tool_scenario.scenario_id,
test_run_type=TestRunType.NL_GENERATION,
checks=["function_call_validator",
"is_function_correct",
"are_all_params_expected",
"are_required_params_present",
"do_param_values_match"]
)
Below is an explanation of what each function call check does:

To run your Okareo code, you'll need to have Okareo installed locally and run the okareo run command. For more information on this, follow the instructions in the README.
Once you've run an Okareo evaluation, you can view the results inside the app in the Evaluations tab, or you can have your code print out a direct link to your evaluation's results page like so:
print(f"See results in Okareo: {evaluation.app_link}")
When viewing the results page, you can see an overview of how well your LLM (or, in this case, our pretend model script) has performed on each check, as well as specific results for each individual scenario. You don't have to view these results visually; you can also choose to read them programmatically. For example, you can set up automations to run regular evaluations as part of your CI workflow and inform you of any regressions.

Clicking on each individual row allows you to view the input, expected result, and tools invoked, which can help you debug why there was a failure.

In the screen capture above, you can see that the second row failed because no name was given in the input. You can use this information to make your system more robust. If you don't want the function call to fail in this scenario, you could make the username parameter optional and have the function supply a default parameter in cases where there is no value, such as "New User123." The user can then later be forced to give a real name when they first log into their account.
Our example scenario set contains a mixture of scenarios, some of which will fail different checks, allowing you to see some failures in action and understand how these failures will help you improve your system. For example, you can see that rows 1 and 4 have both failed the do_param_values_match
check (because the name "john" is lowercase and the name "steve." is lowercase and has a period at the end). Seeing this, you might choose to convert all names to lowercase and remove trailing periods or whitespace).
You can also see that row 3 has failed the is_function_correct check
because the user used the term "make an account" instead of "create an account." In this case, updating the description of the function call in the schema or adding more information to the system prompt might help the LLM understand that these phrases are equivalent.
Use Okareo's function-calling eval alongside evaluation of all your LLM's behaviors
Function-calling evaluation is just one aspect of the evaluations you need to do for your LLMs. You can use Okareo to evaluate the output of any LLM, whether that's natural language, JSON, or code. It can also evaluate multi-turn conversations between a user and an agent and between agents within an agent network.
You can try out many of these features by installing and signing up for Okareo today.