Emerging Approaches for Agent Evaluations

If you are a developer building a large language model (LLM) application, 2023 was the year of the RAG (Retrieval Augmented Generation). 2024 seems to be the year of the Agents, with everything being labeled an ‘agent’ now. If by now you haven’t heard of RAG, then we are definitely moving in different circles.

In this case, I wanted to put down a few thoughts on agents and agent evaluations. LLM evaluations are still a hard problem, but they are even harder for complex agent based systems. In the larger developer community consensus is starting to build that evaluations are super important for successful LLM application deployments. More mature teams have implemented evaluations in their development process and the larger community is moving towards standardizing the practice, including as part of CI/CD. In parallel, as it typically happens, agent evaluations are raising the complexity by an order of magnitude.

What is an Agent anyway?

Believe me, I’d prefer not to spend time on this question, but unfortunately, I think it’s necessary to be ‘grounded’ on this before going further. I only mention this because sometimes I hear the term used to refer to a system prompt with some instructions. So here are a few traits I think separate an agent based LLM system:

Autonomy in interpreting, breaking down and achieving high level goals, with limited to no human guidance
Independent ability to complete actions in the real world - spin up Kubernetes cluster, schedule a meeting, book airline tickets
Facility for planning function, invoking tools, and observing environment changes as a result of an action

Useful Abstractions for LLM Applications

As a developer it strikes me how many parallels there are between programming abstractions and what is emerging as building patterns for LLM applications. LLM invocations are used to replace whole modules of an application and are composed together for a rapid development paradigm that is changing how software is built. Not to mention generating code to glue it all together. One can think of foundational LLM as part of your application stack, similar to MEAN (MongoDB, Express. js, Angular, Node. js) or LAMP (Linux, Apache, MySQL, PHP) stacks. In this context agent pattern is a programming abstraction on top of LLM stack. Instead of trying to stuff all your application logic into a single, unwieldy prompt, you can route to agents with different responsibilities that can also coordinate on more complex tasks.

For developers this is not new, putting all your logic into a single prompt doesn’t make sense for many reasons regardless of the context size. We are starting to see many such design patterns starting to emerge on top of LLM stack. I like how Andrew Ng breaks down the patterns seen so far.

I see our customers already adopting variations of these patterns in production. Developers modeling agent structure as company hierarchy (worker and manager agents) or team roles (designer, developer, analyst agents) has parallels to typical software abstractions (director/manager/builder modules or classes in programming). Other software abstractions, like mechanical systems (engine, filter, safety system), could work just as well. The complex pattern of multi-agent collaboration isn’t exactly new either. For example, message-passing agents or processes are a well-established programming abstraction in Erlang and other systems.

Now if we consider each agent pattern and things that could go wrong, it can get messy very quickly… Applying software development paradigm to building with agents means you need to be able to fix your behavior baselines and have a solid foundation to iterate on. Evaluations are the new unit/functional/end-to-end tests. So how do we evaluate these complex agent patterns?

Approaches for Agent Evaluations

Many automatic evaluation (no human evaluators) frameworks to date have focused on single-turn generation, scoring model output based on a given LLM prompt and a reference result. With this approach it is easy to trap entry and exit conditions and substantially easier to reason about what led to erroneous generation. While effective for specific tasks like summarization and sentiment analysis, it is insufficient for evaluating multi-turn agents that are using tools, doing independent planning, and triggering actions with real world implications. Here, we discuss emerging approaches for the latter.

RAG-based Evals - RAG is an LLM application building block, an architecture that fits just as well when building LLM agents. Often teams start with a predetermined RAG flow and then evolve it to a more of an agent based control loop as the application gets more complex. With this in mind some evaluation frameworks adopt this incremental approach of agent evaluation as a RAG system. For example, if the agent is focused primarily on information retrieval tasks and also has autonomy to select the right real time data tools to access flight status, IOT sensors, and such. Treating it as a RAG problem allows evaluating each stage along widely used RAG dimensions: context relevance, answer consistency with context, and answer relevance. The wrinkle is model tool selection and that could be treated as a classification problem, mapping user intent.
LLM Based Dialog Evaluation - Dialog system evaluations have been pursued in academia well before the recent explosion of LLMs. These evaluations are usually performed with the multi-turn dialog history as context and evaluate the quality of the last generated dialog turn. A recent improvement is using LLM as a Judge evaluator to produce ‘unified’ scores. The primary focus has been on the quality of the next turn generation in conversational or chatbot contexts, rather than on autonomous systems that perform planning, task breakdown, tool selection, and tool output interpretation. Consequently, the metrics are often linguistic and broad in nature, which is useful in academic settings.
LLM as Simulated User - There are several variations of this implementation, but the main idea is to have an LLM role-play a user, interacting with the agent system under evaluation. The user definition can be as simple as a prompt with instructions on the user's role and a high-level goal to achieve, such as requesting a complex itinerary booking. More advanced implementations could include a dialog scenario generator that uses past dialogs for requested task variety, LLMs to generate user personalities and dialog details, and a simulation agent to drive conversations with these simulated users. Notably, most commercial evaluation frameworks simulate only the user side of the agent interaction and not the environment side (e.g. operating system shell).
Adversarial Models - This is the 'most emerging' of all the emerging approaches on this list. It borrows from model based red-teaming techniques but applies broader than model security and safety. The idea is to sample potential test scenarios from adversarial LLM and run these through an agent being evaluated. Adversarial model could then further be fine-tuned on failed scenarios. This could apply to any expected agent behavior, outside of safety. The approach typically combines a search function with adversarial model sampling and maps successful and failed outcomes to particular task categories.

Metrics

Looking at the above approaches, the big question is what metrics could be used to score the evaluation results. Indeed, this is the critical part to make the approaches at all useful. This is also where things get a bit more complicated… Metrics being used fall into these categories:

RAG/ Retrieval Metrics - As mentioned, for retrieval oriented Agents that are primarily used to find and synthesize information requested, this could be enough. Obviously it’s limiting for many other agent use cases.
Dialog Evaluation Metrics - In the context of building LLM applications, these could be used as guardrail metrics, as a proxy to overall system health. Other than that, it’s not directly actionable with respect to agent evaluation on particular tasks. For example, getting a Coherence score of 3.5 does not tell you if the agent accomplished a task, it only speaks to the quality of the dialog being generated.
Agent Trajectory Decomposition - Agent trajectory, or the steps agent takes, tools it uses, is another aspect leveraged in agent evaluations. This is more of a white-box scoring method and relies on evaluation sets with predefined tasks and success criteria. In some cases this is taken even further with agent ‘progress rate metrics’ towards completion, versus simple pass/fail for a given task. Progress rate metrics primarily rely on subjective breakdown of tasks into subtasks that could be measured.
Use Case Specific Metric - On the other end of the spectrum there are metrics that capture the overall LLM application goal or use case. For example, this could be Issue Resolution Rate for a Customer Service Agent. Being able to identify and measure metrics like these is ideal as it aligns to the key application outcome. More detailed metrics, could ladder up to this top level outcome metric. In practice, constructing metrics like this is difficult. In some cases LLM as a Judge method could be used to approximate this.

Metric Issues to Consider:

For an average 1-2 pizza development team it could be just too much extra effort to manually create enough evaluation data sets for each agent, including success criteria by task.
Having to define agent task and trajectory success criteria ahead of time assumes that the task is known in advance and there is a single successful path to that outcome. This excludes evaluation on production scenarios not seen before. It also ignores that there could be multiple successful paths to the same outcome. For instance, if the task is to add account values stored in a given file it could be done equally well via a python script or shell command. The output could also be captured in different formats.
For any metrics chosen what is often overlooked is measuring cost (API based model costs or compute costs for self hosted) and latency to get to actual task completion. Any agent success measures are not useful outside the context of cost and latency. In real world LLM applications and multi-turn generations, these two can often be prohibitive.

I plan to add more references for those interested in going deeper in the next revision.

If you are a developer building a large language model (LLM) application, 2023 was the year of the RAG (Retrieval Augmented Generation). 2024 seems to be the year of the Agents, with everything being labeled an ‘agent’ now. If by now you haven’t heard of RAG, then we are definitely moving in different circles.

In this case, I wanted to put down a few thoughts on agents and agent evaluations. LLM evaluations are still a hard problem, but they are even harder for complex agent based systems. In the larger developer community consensus is starting to build that evaluations are super important for successful LLM application deployments. More mature teams have implemented evaluations in their development process and the larger community is moving towards standardizing the practice, including as part of CI/CD. In parallel, as it typically happens, agent evaluations are raising the complexity by an order of magnitude.

What is an Agent anyway?

Believe me, I’d prefer not to spend time on this question, but unfortunately, I think it’s necessary to be ‘grounded’ on this before going further. I only mention this because sometimes I hear the term used to refer to a system prompt with some instructions. So here are a few traits I think separate an agent based LLM system:

Autonomy in interpreting, breaking down and achieving high level goals, with limited to no human guidance
Independent ability to complete actions in the real world - spin up Kubernetes cluster, schedule a meeting, book airline tickets
Facility for planning function, invoking tools, and observing environment changes as a result of an action

Useful Abstractions for LLM Applications

As a developer it strikes me how many parallels there are between programming abstractions and what is emerging as building patterns for LLM applications. LLM invocations are used to replace whole modules of an application and are composed together for a rapid development paradigm that is changing how software is built. Not to mention generating code to glue it all together. One can think of foundational LLM as part of your application stack, similar to MEAN (MongoDB, Express. js, Angular, Node. js) or LAMP (Linux, Apache, MySQL, PHP) stacks. In this context agent pattern is a programming abstraction on top of LLM stack. Instead of trying to stuff all your application logic into a single, unwieldy prompt, you can route to agents with different responsibilities that can also coordinate on more complex tasks.

For developers this is not new, putting all your logic into a single prompt doesn’t make sense for many reasons regardless of the context size. We are starting to see many such design patterns starting to emerge on top of LLM stack. I like how Andrew Ng breaks down the patterns seen so far.

I see our customers already adopting variations of these patterns in production. Developers modeling agent structure as company hierarchy (worker and manager agents) or team roles (designer, developer, analyst agents) has parallels to typical software abstractions (director/manager/builder modules or classes in programming). Other software abstractions, like mechanical systems (engine, filter, safety system), could work just as well. The complex pattern of multi-agent collaboration isn’t exactly new either. For example, message-passing agents or processes are a well-established programming abstraction in Erlang and other systems.

Now if we consider each agent pattern and things that could go wrong, it can get messy very quickly… Applying software development paradigm to building with agents means you need to be able to fix your behavior baselines and have a solid foundation to iterate on. Evaluations are the new unit/functional/end-to-end tests. So how do we evaluate these complex agent patterns?

Approaches for Agent Evaluations

Many automatic evaluation (no human evaluators) frameworks to date have focused on single-turn generation, scoring model output based on a given LLM prompt and a reference result. With this approach it is easy to trap entry and exit conditions and substantially easier to reason about what led to erroneous generation. While effective for specific tasks like summarization and sentiment analysis, it is insufficient for evaluating multi-turn agents that are using tools, doing independent planning, and triggering actions with real world implications. Here, we discuss emerging approaches for the latter.

RAG-based Evals - RAG is an LLM application building block, an architecture that fits just as well when building LLM agents. Often teams start with a predetermined RAG flow and then evolve it to a more of an agent based control loop as the application gets more complex. With this in mind some evaluation frameworks adopt this incremental approach of agent evaluation as a RAG system. For example, if the agent is focused primarily on information retrieval tasks and also has autonomy to select the right real time data tools to access flight status, IOT sensors, and such. Treating it as a RAG problem allows evaluating each stage along widely used RAG dimensions: context relevance, answer consistency with context, and answer relevance. The wrinkle is model tool selection and that could be treated as a classification problem, mapping user intent.
LLM Based Dialog Evaluation - Dialog system evaluations have been pursued in academia well before the recent explosion of LLMs. These evaluations are usually performed with the multi-turn dialog history as context and evaluate the quality of the last generated dialog turn. A recent improvement is using LLM as a Judge evaluator to produce ‘unified’ scores. The primary focus has been on the quality of the next turn generation in conversational or chatbot contexts, rather than on autonomous systems that perform planning, task breakdown, tool selection, and tool output interpretation. Consequently, the metrics are often linguistic and broad in nature, which is useful in academic settings.
LLM as Simulated User - There are several variations of this implementation, but the main idea is to have an LLM role-play a user, interacting with the agent system under evaluation. The user definition can be as simple as a prompt with instructions on the user's role and a high-level goal to achieve, such as requesting a complex itinerary booking. More advanced implementations could include a dialog scenario generator that uses past dialogs for requested task variety, LLMs to generate user personalities and dialog details, and a simulation agent to drive conversations with these simulated users. Notably, most commercial evaluation frameworks simulate only the user side of the agent interaction and not the environment side (e.g. operating system shell).
Adversarial Models - This is the 'most emerging' of all the emerging approaches on this list. It borrows from model based red-teaming techniques but applies broader than model security and safety. The idea is to sample potential test scenarios from adversarial LLM and run these through an agent being evaluated. Adversarial model could then further be fine-tuned on failed scenarios. This could apply to any expected agent behavior, outside of safety. The approach typically combines a search function with adversarial model sampling and maps successful and failed outcomes to particular task categories.

Metrics

Looking at the above approaches, the big question is what metrics could be used to score the evaluation results. Indeed, this is the critical part to make the approaches at all useful. This is also where things get a bit more complicated… Metrics being used fall into these categories:

RAG/ Retrieval Metrics - As mentioned, for retrieval oriented Agents that are primarily used to find and synthesize information requested, this could be enough. Obviously it’s limiting for many other agent use cases.
Dialog Evaluation Metrics - In the context of building LLM applications, these could be used as guardrail metrics, as a proxy to overall system health. Other than that, it’s not directly actionable with respect to agent evaluation on particular tasks. For example, getting a Coherence score of 3.5 does not tell you if the agent accomplished a task, it only speaks to the quality of the dialog being generated.
Agent Trajectory Decomposition - Agent trajectory, or the steps agent takes, tools it uses, is another aspect leveraged in agent evaluations. This is more of a white-box scoring method and relies on evaluation sets with predefined tasks and success criteria. In some cases this is taken even further with agent ‘progress rate metrics’ towards completion, versus simple pass/fail for a given task. Progress rate metrics primarily rely on subjective breakdown of tasks into subtasks that could be measured.
Use Case Specific Metric - On the other end of the spectrum there are metrics that capture the overall LLM application goal or use case. For example, this could be Issue Resolution Rate for a Customer Service Agent. Being able to identify and measure metrics like these is ideal as it aligns to the key application outcome. More detailed metrics, could ladder up to this top level outcome metric. In practice, constructing metrics like this is difficult. In some cases LLM as a Judge method could be used to approximate this.

Metric Issues to Consider:

For an average 1-2 pizza development team it could be just too much extra effort to manually create enough evaluation data sets for each agent, including success criteria by task.
Having to define agent task and trajectory success criteria ahead of time assumes that the task is known in advance and there is a single successful path to that outcome. This excludes evaluation on production scenarios not seen before. It also ignores that there could be multiple successful paths to the same outcome. For instance, if the task is to add account values stored in a given file it could be done equally well via a python script or shell command. The output could also be captured in different formats.
For any metrics chosen what is often overlooked is measuring cost (API based model costs or compute costs for self hosted) and latency to get to actual task completion. Any agent success measures are not useful outside the context of cost and latency. In real world LLM applications and multi-turn generations, these two can often be prohibitive.

I plan to add more references for those interested in going deeper in the next revision.

If you are a developer building a large language model (LLM) application, 2023 was the year of the RAG (Retrieval Augmented Generation). 2024 seems to be the year of the Agents, with everything being labeled an ‘agent’ now. If by now you haven’t heard of RAG, then we are definitely moving in different circles.

In this case, I wanted to put down a few thoughts on agents and agent evaluations. LLM evaluations are still a hard problem, but they are even harder for complex agent based systems. In the larger developer community consensus is starting to build that evaluations are super important for successful LLM application deployments. More mature teams have implemented evaluations in their development process and the larger community is moving towards standardizing the practice, including as part of CI/CD. In parallel, as it typically happens, agent evaluations are raising the complexity by an order of magnitude.

What is an Agent anyway?

Believe me, I’d prefer not to spend time on this question, but unfortunately, I think it’s necessary to be ‘grounded’ on this before going further. I only mention this because sometimes I hear the term used to refer to a system prompt with some instructions. So here are a few traits I think separate an agent based LLM system:

Autonomy in interpreting, breaking down and achieving high level goals, with limited to no human guidance
Independent ability to complete actions in the real world - spin up Kubernetes cluster, schedule a meeting, book airline tickets
Facility for planning function, invoking tools, and observing environment changes as a result of an action

Useful Abstractions for LLM Applications

As a developer it strikes me how many parallels there are between programming abstractions and what is emerging as building patterns for LLM applications. LLM invocations are used to replace whole modules of an application and are composed together for a rapid development paradigm that is changing how software is built. Not to mention generating code to glue it all together. One can think of foundational LLM as part of your application stack, similar to MEAN (MongoDB, Express. js, Angular, Node. js) or LAMP (Linux, Apache, MySQL, PHP) stacks. In this context agent pattern is a programming abstraction on top of LLM stack. Instead of trying to stuff all your application logic into a single, unwieldy prompt, you can route to agents with different responsibilities that can also coordinate on more complex tasks.

For developers this is not new, putting all your logic into a single prompt doesn’t make sense for many reasons regardless of the context size. We are starting to see many such design patterns starting to emerge on top of LLM stack. I like how Andrew Ng breaks down the patterns seen so far.

I see our customers already adopting variations of these patterns in production. Developers modeling agent structure as company hierarchy (worker and manager agents) or team roles (designer, developer, analyst agents) has parallels to typical software abstractions (director/manager/builder modules or classes in programming). Other software abstractions, like mechanical systems (engine, filter, safety system), could work just as well. The complex pattern of multi-agent collaboration isn’t exactly new either. For example, message-passing agents or processes are a well-established programming abstraction in Erlang and other systems.

Now if we consider each agent pattern and things that could go wrong, it can get messy very quickly… Applying software development paradigm to building with agents means you need to be able to fix your behavior baselines and have a solid foundation to iterate on. Evaluations are the new unit/functional/end-to-end tests. So how do we evaluate these complex agent patterns?

Approaches for Agent Evaluations

Many automatic evaluation (no human evaluators) frameworks to date have focused on single-turn generation, scoring model output based on a given LLM prompt and a reference result. With this approach it is easy to trap entry and exit conditions and substantially easier to reason about what led to erroneous generation. While effective for specific tasks like summarization and sentiment analysis, it is insufficient for evaluating multi-turn agents that are using tools, doing independent planning, and triggering actions with real world implications. Here, we discuss emerging approaches for the latter.

RAG-based Evals - RAG is an LLM application building block, an architecture that fits just as well when building LLM agents. Often teams start with a predetermined RAG flow and then evolve it to a more of an agent based control loop as the application gets more complex. With this in mind some evaluation frameworks adopt this incremental approach of agent evaluation as a RAG system. For example, if the agent is focused primarily on information retrieval tasks and also has autonomy to select the right real time data tools to access flight status, IOT sensors, and such. Treating it as a RAG problem allows evaluating each stage along widely used RAG dimensions: context relevance, answer consistency with context, and answer relevance. The wrinkle is model tool selection and that could be treated as a classification problem, mapping user intent.
LLM Based Dialog Evaluation - Dialog system evaluations have been pursued in academia well before the recent explosion of LLMs. These evaluations are usually performed with the multi-turn dialog history as context and evaluate the quality of the last generated dialog turn. A recent improvement is using LLM as a Judge evaluator to produce ‘unified’ scores. The primary focus has been on the quality of the next turn generation in conversational or chatbot contexts, rather than on autonomous systems that perform planning, task breakdown, tool selection, and tool output interpretation. Consequently, the metrics are often linguistic and broad in nature, which is useful in academic settings.
LLM as Simulated User - There are several variations of this implementation, but the main idea is to have an LLM role-play a user, interacting with the agent system under evaluation. The user definition can be as simple as a prompt with instructions on the user's role and a high-level goal to achieve, such as requesting a complex itinerary booking. More advanced implementations could include a dialog scenario generator that uses past dialogs for requested task variety, LLMs to generate user personalities and dialog details, and a simulation agent to drive conversations with these simulated users. Notably, most commercial evaluation frameworks simulate only the user side of the agent interaction and not the environment side (e.g. operating system shell).
Adversarial Models - This is the 'most emerging' of all the emerging approaches on this list. It borrows from model based red-teaming techniques but applies broader than model security and safety. The idea is to sample potential test scenarios from adversarial LLM and run these through an agent being evaluated. Adversarial model could then further be fine-tuned on failed scenarios. This could apply to any expected agent behavior, outside of safety. The approach typically combines a search function with adversarial model sampling and maps successful and failed outcomes to particular task categories.

Metrics

Looking at the above approaches, the big question is what metrics could be used to score the evaluation results. Indeed, this is the critical part to make the approaches at all useful. This is also where things get a bit more complicated… Metrics being used fall into these categories:

RAG/ Retrieval Metrics - As mentioned, for retrieval oriented Agents that are primarily used to find and synthesize information requested, this could be enough. Obviously it’s limiting for many other agent use cases.
Dialog Evaluation Metrics - In the context of building LLM applications, these could be used as guardrail metrics, as a proxy to overall system health. Other than that, it’s not directly actionable with respect to agent evaluation on particular tasks. For example, getting a Coherence score of 3.5 does not tell you if the agent accomplished a task, it only speaks to the quality of the dialog being generated.
Agent Trajectory Decomposition - Agent trajectory, or the steps agent takes, tools it uses, is another aspect leveraged in agent evaluations. This is more of a white-box scoring method and relies on evaluation sets with predefined tasks and success criteria. In some cases this is taken even further with agent ‘progress rate metrics’ towards completion, versus simple pass/fail for a given task. Progress rate metrics primarily rely on subjective breakdown of tasks into subtasks that could be measured.
Use Case Specific Metric - On the other end of the spectrum there are metrics that capture the overall LLM application goal or use case. For example, this could be Issue Resolution Rate for a Customer Service Agent. Being able to identify and measure metrics like these is ideal as it aligns to the key application outcome. More detailed metrics, could ladder up to this top level outcome metric. In practice, constructing metrics like this is difficult. In some cases LLM as a Judge method could be used to approximate this.

Metric Issues to Consider:

For an average 1-2 pizza development team it could be just too much extra effort to manually create enough evaluation data sets for each agent, including success criteria by task.
Having to define agent task and trajectory success criteria ahead of time assumes that the task is known in advance and there is a single successful path to that outcome. This excludes evaluation on production scenarios not seen before. It also ignores that there could be multiple successful paths to the same outcome. For instance, if the task is to add account values stored in a given file it could be done equally well via a python script or shell command. The output could also be captured in different formats.
For any metrics chosen what is often overlooked is measuring cost (API based model costs or compute costs for self hosted) and latency to get to actual task completion. Any agent success measures are not useful outside the context of cost and latency. In real world LLM applications and multi-turn generations, these two can often be prohibitive.

I plan to add more references for those interested in going deeper in the next revision.

If you are a developer building a large language model (LLM) application, 2023 was the year of the RAG (Retrieval Augmented Generation). 2024 seems to be the year of the Agents, with everything being labeled an ‘agent’ now. If by now you haven’t heard of RAG, then we are definitely moving in different circles.

In this case, I wanted to put down a few thoughts on agents and agent evaluations. LLM evaluations are still a hard problem, but they are even harder for complex agent based systems. In the larger developer community consensus is starting to build that evaluations are super important for successful LLM application deployments. More mature teams have implemented evaluations in their development process and the larger community is moving towards standardizing the practice, including as part of CI/CD. In parallel, as it typically happens, agent evaluations are raising the complexity by an order of magnitude.

What is an Agent anyway?

Believe me, I’d prefer not to spend time on this question, but unfortunately, I think it’s necessary to be ‘grounded’ on this before going further. I only mention this because sometimes I hear the term used to refer to a system prompt with some instructions. So here are a few traits I think separate an agent based LLM system:

Autonomy in interpreting, breaking down and achieving high level goals, with limited to no human guidance
Independent ability to complete actions in the real world - spin up Kubernetes cluster, schedule a meeting, book airline tickets
Facility for planning function, invoking tools, and observing environment changes as a result of an action

Useful Abstractions for LLM Applications

As a developer it strikes me how many parallels there are between programming abstractions and what is emerging as building patterns for LLM applications. LLM invocations are used to replace whole modules of an application and are composed together for a rapid development paradigm that is changing how software is built. Not to mention generating code to glue it all together. One can think of foundational LLM as part of your application stack, similar to MEAN (MongoDB, Express. js, Angular, Node. js) or LAMP (Linux, Apache, MySQL, PHP) stacks. In this context agent pattern is a programming abstraction on top of LLM stack. Instead of trying to stuff all your application logic into a single, unwieldy prompt, you can route to agents with different responsibilities that can also coordinate on more complex tasks.

For developers this is not new, putting all your logic into a single prompt doesn’t make sense for many reasons regardless of the context size. We are starting to see many such design patterns starting to emerge on top of LLM stack. I like how Andrew Ng breaks down the patterns seen so far.

I see our customers already adopting variations of these patterns in production. Developers modeling agent structure as company hierarchy (worker and manager agents) or team roles (designer, developer, analyst agents) has parallels to typical software abstractions (director/manager/builder modules or classes in programming). Other software abstractions, like mechanical systems (engine, filter, safety system), could work just as well. The complex pattern of multi-agent collaboration isn’t exactly new either. For example, message-passing agents or processes are a well-established programming abstraction in Erlang and other systems.

Now if we consider each agent pattern and things that could go wrong, it can get messy very quickly… Applying software development paradigm to building with agents means you need to be able to fix your behavior baselines and have a solid foundation to iterate on. Evaluations are the new unit/functional/end-to-end tests. So how do we evaluate these complex agent patterns?

Approaches for Agent Evaluations

Many automatic evaluation (no human evaluators) frameworks to date have focused on single-turn generation, scoring model output based on a given LLM prompt and a reference result. With this approach it is easy to trap entry and exit conditions and substantially easier to reason about what led to erroneous generation. While effective for specific tasks like summarization and sentiment analysis, it is insufficient for evaluating multi-turn agents that are using tools, doing independent planning, and triggering actions with real world implications. Here, we discuss emerging approaches for the latter.

RAG-based Evals - RAG is an LLM application building block, an architecture that fits just as well when building LLM agents. Often teams start with a predetermined RAG flow and then evolve it to a more of an agent based control loop as the application gets more complex. With this in mind some evaluation frameworks adopt this incremental approach of agent evaluation as a RAG system. For example, if the agent is focused primarily on information retrieval tasks and also has autonomy to select the right real time data tools to access flight status, IOT sensors, and such. Treating it as a RAG problem allows evaluating each stage along widely used RAG dimensions: context relevance, answer consistency with context, and answer relevance. The wrinkle is model tool selection and that could be treated as a classification problem, mapping user intent.
LLM Based Dialog Evaluation - Dialog system evaluations have been pursued in academia well before the recent explosion of LLMs. These evaluations are usually performed with the multi-turn dialog history as context and evaluate the quality of the last generated dialog turn. A recent improvement is using LLM as a Judge evaluator to produce ‘unified’ scores. The primary focus has been on the quality of the next turn generation in conversational or chatbot contexts, rather than on autonomous systems that perform planning, task breakdown, tool selection, and tool output interpretation. Consequently, the metrics are often linguistic and broad in nature, which is useful in academic settings.
LLM as Simulated User - There are several variations of this implementation, but the main idea is to have an LLM role-play a user, interacting with the agent system under evaluation. The user definition can be as simple as a prompt with instructions on the user's role and a high-level goal to achieve, such as requesting a complex itinerary booking. More advanced implementations could include a dialog scenario generator that uses past dialogs for requested task variety, LLMs to generate user personalities and dialog details, and a simulation agent to drive conversations with these simulated users. Notably, most commercial evaluation frameworks simulate only the user side of the agent interaction and not the environment side (e.g. operating system shell).
Adversarial Models - This is the 'most emerging' of all the emerging approaches on this list. It borrows from model based red-teaming techniques but applies broader than model security and safety. The idea is to sample potential test scenarios from adversarial LLM and run these through an agent being evaluated. Adversarial model could then further be fine-tuned on failed scenarios. This could apply to any expected agent behavior, outside of safety. The approach typically combines a search function with adversarial model sampling and maps successful and failed outcomes to particular task categories.

Metrics

Looking at the above approaches, the big question is what metrics could be used to score the evaluation results. Indeed, this is the critical part to make the approaches at all useful. This is also where things get a bit more complicated… Metrics being used fall into these categories:

RAG/ Retrieval Metrics - As mentioned, for retrieval oriented Agents that are primarily used to find and synthesize information requested, this could be enough. Obviously it’s limiting for many other agent use cases.
Dialog Evaluation Metrics - In the context of building LLM applications, these could be used as guardrail metrics, as a proxy to overall system health. Other than that, it’s not directly actionable with respect to agent evaluation on particular tasks. For example, getting a Coherence score of 3.5 does not tell you if the agent accomplished a task, it only speaks to the quality of the dialog being generated.
Agent Trajectory Decomposition - Agent trajectory, or the steps agent takes, tools it uses, is another aspect leveraged in agent evaluations. This is more of a white-box scoring method and relies on evaluation sets with predefined tasks and success criteria. In some cases this is taken even further with agent ‘progress rate metrics’ towards completion, versus simple pass/fail for a given task. Progress rate metrics primarily rely on subjective breakdown of tasks into subtasks that could be measured.
Use Case Specific Metric - On the other end of the spectrum there are metrics that capture the overall LLM application goal or use case. For example, this could be Issue Resolution Rate for a Customer Service Agent. Being able to identify and measure metrics like these is ideal as it aligns to the key application outcome. More detailed metrics, could ladder up to this top level outcome metric. In practice, constructing metrics like this is difficult. In some cases LLM as a Judge method could be used to approximate this.

Metric Issues to Consider:

For an average 1-2 pizza development team it could be just too much extra effort to manually create enough evaluation data sets for each agent, including success criteria by task.
Having to define agent task and trajectory success criteria ahead of time assumes that the task is known in advance and there is a single successful path to that outcome. This excludes evaluation on production scenarios not seen before. It also ignores that there could be multiple successful paths to the same outcome. For instance, if the task is to add account values stored in a given file it could be done equally well via a python script or shell command. The output could also be captured in different formats.
For any metrics chosen what is often overlooked is measuring cost (API based model costs or compute costs for self hosted) and latency to get to actual task completion. Any agent success measures are not useful outside the context of cost and latency. In real world LLM applications and multi-turn generations, these two can often be prohibitive.

I plan to add more references for those interested in going deeper in the next revision.

Emerging Approaches for Agent Evaluations

What is an Agent anyway?

Useful Abstractions for LLM Applications

Approaches for Agent Evaluations

Metrics

Metric Issues to Consider:

What is an Agent anyway?

Useful Abstractions for LLM Applications

Approaches for Agent Evaluations

Metrics

Metric Issues to Consider:

What is an Agent anyway?

Useful Abstractions for LLM Applications

Approaches for Agent Evaluations

Metrics

Metric Issues to Consider:

What is an Agent anyway?

Useful Abstractions for LLM Applications

Approaches for Agent Evaluations

Metrics

Metric Issues to Consider:

Join the trusted

Future of AI

Join the trusted

Future of AI

Join the trusted

Future of AI

You might also like...

RAG Optimization: Techniques to Make your RAG Faster, Cheaper, and More Accurate

Synthetic Data Loop

Summarization Evaluation

RAG Optimization: Techniques to Make your RAG Faster, Cheaper, and More Accurate

Synthetic Data Loop

RAG Optimization: Techniques to Make your RAG Faster, Cheaper, and More Accurate

Synthetic Data Loop