Webinar: Evaluating Agentic Function Calls in Production
Agentics

Mason del Rosario, PhD
,
Founding Staff Machine Learning Engineer
April 1, 2025
Key Topics
Function call definition/examples
Function call definition: A mechanism that makes an LLM “agentic” by letting it interact with an external system
Function call examples
Making an API call to an app
Generating/executing code
Calling another agent
Reference-based evaluations
Reference definition: “ground truth” or “label” to compare a generated function call against
Berkeley Function Call Leaderboard (BFCL): A Berkeley-published benchmark used to compare agents’ function call capabilities
BFCL Demo: In Okareo, we use the BFCL team’s reference-based metric, AST Check, to statically evaluate generated function calls
Problem: In production, we do not have reference function calls available.
Reference-free evaluations: LLM as a judge
Show the judge the user query and the generated function call
Describe pass/fail criteria
Give a few examples of good vs. bad function calls
Get a pass/fail result
Demo: Debugging a weather agent in Okareo
Example application: Agent with a “get weather” function call instrumented with Okareo
Okareo tracing concepts:
Monitor: A group of filter criteria that allows us to organize the completions and associate checks with them
Checks: An LLM as judge that can be applied automatically to in-monitor completions
Demo outcome: Okareo’s checks help uncover erroneous function calls and validate that improvements to the agent resolve the function call errors
Key Takeaways
Function calls turn LLMs into true agents, and conventional reference-based evaluations can help ensure that agents meet performance benchmarks. However, such evaluations fail to translate to a production setting where references are unavailable. To overcome the need for references, using an LLM to judge the quality of an agent’s function calls is a compelling approach. We demonstrate the viability of LLM judges by applying Okareo’s Monitors and Checks to an example weather agent, and we show how these tools can help us identify and resolve issues with our agent’s function calls. Finally, we conclude with some thoughts on balancing the need for reference vs. reference-free evaluations, and how the outputs of each activity can improve the other.

Key Topics
Function call definition/examples
Function call definition: A mechanism that makes an LLM “agentic” by letting it interact with an external system
Function call examples
Making an API call to an app
Generating/executing code
Calling another agent
Reference-based evaluations
Reference definition: “ground truth” or “label” to compare a generated function call against
Berkeley Function Call Leaderboard (BFCL): A Berkeley-published benchmark used to compare agents’ function call capabilities
BFCL Demo: In Okareo, we use the BFCL team’s reference-based metric, AST Check, to statically evaluate generated function calls
Problem: In production, we do not have reference function calls available.
Reference-free evaluations: LLM as a judge
Show the judge the user query and the generated function call
Describe pass/fail criteria
Give a few examples of good vs. bad function calls
Get a pass/fail result
Demo: Debugging a weather agent in Okareo
Example application: Agent with a “get weather” function call instrumented with Okareo
Okareo tracing concepts:
Monitor: A group of filter criteria that allows us to organize the completions and associate checks with them
Checks: An LLM as judge that can be applied automatically to in-monitor completions
Demo outcome: Okareo’s checks help uncover erroneous function calls and validate that improvements to the agent resolve the function call errors
Key Takeaways
Function calls turn LLMs into true agents, and conventional reference-based evaluations can help ensure that agents meet performance benchmarks. However, such evaluations fail to translate to a production setting where references are unavailable. To overcome the need for references, using an LLM to judge the quality of an agent’s function calls is a compelling approach. We demonstrate the viability of LLM judges by applying Okareo’s Monitors and Checks to an example weather agent, and we show how these tools can help us identify and resolve issues with our agent’s function calls. Finally, we conclude with some thoughts on balancing the need for reference vs. reference-free evaluations, and how the outputs of each activity can improve the other.

Key Topics
Function call definition/examples
Function call definition: A mechanism that makes an LLM “agentic” by letting it interact with an external system
Function call examples
Making an API call to an app
Generating/executing code
Calling another agent
Reference-based evaluations
Reference definition: “ground truth” or “label” to compare a generated function call against
Berkeley Function Call Leaderboard (BFCL): A Berkeley-published benchmark used to compare agents’ function call capabilities
BFCL Demo: In Okareo, we use the BFCL team’s reference-based metric, AST Check, to statically evaluate generated function calls
Problem: In production, we do not have reference function calls available.
Reference-free evaluations: LLM as a judge
Show the judge the user query and the generated function call
Describe pass/fail criteria
Give a few examples of good vs. bad function calls
Get a pass/fail result
Demo: Debugging a weather agent in Okareo
Example application: Agent with a “get weather” function call instrumented with Okareo
Okareo tracing concepts:
Monitor: A group of filter criteria that allows us to organize the completions and associate checks with them
Checks: An LLM as judge that can be applied automatically to in-monitor completions
Demo outcome: Okareo’s checks help uncover erroneous function calls and validate that improvements to the agent resolve the function call errors
Key Takeaways
Function calls turn LLMs into true agents, and conventional reference-based evaluations can help ensure that agents meet performance benchmarks. However, such evaluations fail to translate to a production setting where references are unavailable. To overcome the need for references, using an LLM to judge the quality of an agent’s function calls is a compelling approach. We demonstrate the viability of LLM judges by applying Okareo’s Monitors and Checks to an example weather agent, and we show how these tools can help us identify and resolve issues with our agent’s function calls. Finally, we conclude with some thoughts on balancing the need for reference vs. reference-free evaluations, and how the outputs of each activity can improve the other.

Key Topics
Function call definition/examples
Function call definition: A mechanism that makes an LLM “agentic” by letting it interact with an external system
Function call examples
Making an API call to an app
Generating/executing code
Calling another agent
Reference-based evaluations
Reference definition: “ground truth” or “label” to compare a generated function call against
Berkeley Function Call Leaderboard (BFCL): A Berkeley-published benchmark used to compare agents’ function call capabilities
BFCL Demo: In Okareo, we use the BFCL team’s reference-based metric, AST Check, to statically evaluate generated function calls
Problem: In production, we do not have reference function calls available.
Reference-free evaluations: LLM as a judge
Show the judge the user query and the generated function call
Describe pass/fail criteria
Give a few examples of good vs. bad function calls
Get a pass/fail result
Demo: Debugging a weather agent in Okareo
Example application: Agent with a “get weather” function call instrumented with Okareo
Okareo tracing concepts:
Monitor: A group of filter criteria that allows us to organize the completions and associate checks with them
Checks: An LLM as judge that can be applied automatically to in-monitor completions
Demo outcome: Okareo’s checks help uncover erroneous function calls and validate that improvements to the agent resolve the function call errors
Key Takeaways
Function calls turn LLMs into true agents, and conventional reference-based evaluations can help ensure that agents meet performance benchmarks. However, such evaluations fail to translate to a production setting where references are unavailable. To overcome the need for references, using an LLM to judge the quality of an agent’s function calls is a compelling approach. We demonstrate the viability of LLM judges by applying Okareo’s Monitors and Checks to an example weather agent, and we show how these tools can help us identify and resolve issues with our agent’s function calls. Finally, we conclude with some thoughts on balancing the need for reference vs. reference-free evaluations, and how the outputs of each activity can improve the other.
