Webinar: Evaluating Agentic Function Calls in Production

Agentics

Mason del Rosario, PhD

,

Founding Staff Machine Learning Engineer

April 1, 2025

Key Topics
  • Function call definition/examples

    • Function call definition: A mechanism that makes an LLM “agentic” by letting it interact with an external system

    • Function call examples

      • Making an API call to an app

      • Generating/executing code

      • Calling another agent 

  • Reference-based evaluations

    • Reference definition: “ground truth” or “label” to compare a generated function call against

    • Berkeley Function Call Leaderboard (BFCL): A Berkeley-published benchmark used to compare agents’ function call capabilities

    • BFCL Demo: In Okareo, we use the BFCL team’s reference-based metric, AST Check, to statically evaluate generated function calls

    • Problem: In production, we do not have reference function calls available.

  • Reference-free evaluations: LLM as a judge

    • Show the judge the user query and the generated function call

    • Describe pass/fail criteria

    • Give a few examples of good vs. bad function calls

    • Get a pass/fail result

  • Demo: Debugging a weather agent in Okareo 

    • Example application: Agent with a “get weather” function call instrumented with Okareo

    • Okareo tracing concepts:

      • Monitor: A group of filter criteria that allows us to organize the completions and associate checks with them

      • Checks: An LLM as judge that can be applied automatically to in-monitor completions

    • Demo outcome: Okareo’s checks help uncover erroneous function calls and validate that improvements to the agent resolve the function call errors

Key Takeaways

Function calls turn LLMs into true agents, and conventional reference-based evaluations can help ensure that agents meet performance benchmarks. However, such evaluations fail to translate to a production setting where references are unavailable. To overcome the need for references, using an LLM to judge the quality of an agent’s function calls is a compelling approach. We demonstrate the viability of LLM judges by applying Okareo’s Monitors and Checks to an example weather agent, and we show how these tools can help us identify and resolve issues with our agent’s function calls. Finally, we conclude with some thoughts on balancing the need for reference vs. reference-free evaluations, and how the outputs of each activity can improve the other.

Key Topics
  • Function call definition/examples

    • Function call definition: A mechanism that makes an LLM “agentic” by letting it interact with an external system

    • Function call examples

      • Making an API call to an app

      • Generating/executing code

      • Calling another agent 

  • Reference-based evaluations

    • Reference definition: “ground truth” or “label” to compare a generated function call against

    • Berkeley Function Call Leaderboard (BFCL): A Berkeley-published benchmark used to compare agents’ function call capabilities

    • BFCL Demo: In Okareo, we use the BFCL team’s reference-based metric, AST Check, to statically evaluate generated function calls

    • Problem: In production, we do not have reference function calls available.

  • Reference-free evaluations: LLM as a judge

    • Show the judge the user query and the generated function call

    • Describe pass/fail criteria

    • Give a few examples of good vs. bad function calls

    • Get a pass/fail result

  • Demo: Debugging a weather agent in Okareo 

    • Example application: Agent with a “get weather” function call instrumented with Okareo

    • Okareo tracing concepts:

      • Monitor: A group of filter criteria that allows us to organize the completions and associate checks with them

      • Checks: An LLM as judge that can be applied automatically to in-monitor completions

    • Demo outcome: Okareo’s checks help uncover erroneous function calls and validate that improvements to the agent resolve the function call errors

Key Takeaways

Function calls turn LLMs into true agents, and conventional reference-based evaluations can help ensure that agents meet performance benchmarks. However, such evaluations fail to translate to a production setting where references are unavailable. To overcome the need for references, using an LLM to judge the quality of an agent’s function calls is a compelling approach. We demonstrate the viability of LLM judges by applying Okareo’s Monitors and Checks to an example weather agent, and we show how these tools can help us identify and resolve issues with our agent’s function calls. Finally, we conclude with some thoughts on balancing the need for reference vs. reference-free evaluations, and how the outputs of each activity can improve the other.

Key Topics
  • Function call definition/examples

    • Function call definition: A mechanism that makes an LLM “agentic” by letting it interact with an external system

    • Function call examples

      • Making an API call to an app

      • Generating/executing code

      • Calling another agent 

  • Reference-based evaluations

    • Reference definition: “ground truth” or “label” to compare a generated function call against

    • Berkeley Function Call Leaderboard (BFCL): A Berkeley-published benchmark used to compare agents’ function call capabilities

    • BFCL Demo: In Okareo, we use the BFCL team’s reference-based metric, AST Check, to statically evaluate generated function calls

    • Problem: In production, we do not have reference function calls available.

  • Reference-free evaluations: LLM as a judge

    • Show the judge the user query and the generated function call

    • Describe pass/fail criteria

    • Give a few examples of good vs. bad function calls

    • Get a pass/fail result

  • Demo: Debugging a weather agent in Okareo 

    • Example application: Agent with a “get weather” function call instrumented with Okareo

    • Okareo tracing concepts:

      • Monitor: A group of filter criteria that allows us to organize the completions and associate checks with them

      • Checks: An LLM as judge that can be applied automatically to in-monitor completions

    • Demo outcome: Okareo’s checks help uncover erroneous function calls and validate that improvements to the agent resolve the function call errors

Key Takeaways

Function calls turn LLMs into true agents, and conventional reference-based evaluations can help ensure that agents meet performance benchmarks. However, such evaluations fail to translate to a production setting where references are unavailable. To overcome the need for references, using an LLM to judge the quality of an agent’s function calls is a compelling approach. We demonstrate the viability of LLM judges by applying Okareo’s Monitors and Checks to an example weather agent, and we show how these tools can help us identify and resolve issues with our agent’s function calls. Finally, we conclude with some thoughts on balancing the need for reference vs. reference-free evaluations, and how the outputs of each activity can improve the other.

Key Topics
  • Function call definition/examples

    • Function call definition: A mechanism that makes an LLM “agentic” by letting it interact with an external system

    • Function call examples

      • Making an API call to an app

      • Generating/executing code

      • Calling another agent 

  • Reference-based evaluations

    • Reference definition: “ground truth” or “label” to compare a generated function call against

    • Berkeley Function Call Leaderboard (BFCL): A Berkeley-published benchmark used to compare agents’ function call capabilities

    • BFCL Demo: In Okareo, we use the BFCL team’s reference-based metric, AST Check, to statically evaluate generated function calls

    • Problem: In production, we do not have reference function calls available.

  • Reference-free evaluations: LLM as a judge

    • Show the judge the user query and the generated function call

    • Describe pass/fail criteria

    • Give a few examples of good vs. bad function calls

    • Get a pass/fail result

  • Demo: Debugging a weather agent in Okareo 

    • Example application: Agent with a “get weather” function call instrumented with Okareo

    • Okareo tracing concepts:

      • Monitor: A group of filter criteria that allows us to organize the completions and associate checks with them

      • Checks: An LLM as judge that can be applied automatically to in-monitor completions

    • Demo outcome: Okareo’s checks help uncover erroneous function calls and validate that improvements to the agent resolve the function call errors

Key Takeaways

Function calls turn LLMs into true agents, and conventional reference-based evaluations can help ensure that agents meet performance benchmarks. However, such evaluations fail to translate to a production setting where references are unavailable. To overcome the need for references, using an LLM to judge the quality of an agent’s function calls is a compelling approach. We demonstrate the viability of LLM judges by applying Okareo’s Monitors and Checks to an example weather agent, and we show how these tools can help us identify and resolve issues with our agent’s function calls. Finally, we conclude with some thoughts on balancing the need for reference vs. reference-free evaluations, and how the outputs of each activity can improve the other.

Share:

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.