Evaluation and Checks

Evaluation

Matt Wyman

,

Co-founder of Okareo

May 18, 2024

Evaluating the output of our LLM components makes intuitive sense. But what to evaluate and when to do it is not as clear. From the world of functional testing, we know about the test pyramid that spans unit testing to integration testing and everything between. But model evaluation does not conform nicely to this structure. Data Scientists build models intentionally holding data back for validation and testing. So, what is the right approach to LLM evaluation?

Screenshot of evaluation results showing 7 key metrics including Consistency, Relevance, Fluency, Demo.Actions.Length, and Demo.Summary.JSON

What are evaluations

As developers working with LLM endpoints, we need to bridge our best practices for external API integration testing with learnings from Data Science where behavioral analysis during development is key. But before we jump in, let's agree on some common concepts.

  1. Behavior-based development. The key to effective use of LLMs is to embrace the lack of determinism and think in terms of behaviors.

  2. Transactional interaction. Most LLMs are called through standard API endpoints + payloads. This may seem counter-intuitive to call out. "Of course they are!" you say. Well true. It also allows us to use our well-understood API delivery habits. Remember, AI != LLM. AI can be accessed in a variety of fashions. For example a classifier may augment records in a datastore with meta-data, clustering labels, workflow steps, and more. But nearly all LLMs are called via API.

  3. Evaluation is more than testing. There seems to be a growing use of the term evaluation to mean test. This is far from correct. Because of behavior and transactional-ity, LLM evaluation is critical to development, validation, testing, and production feedback.

Evaluation should be thought of as a collection of distinict assertions on the output of a model. To drive evaluation, you must have a representative range of inputs from which to determine the distribution of assertion results. Two evaluation runs of an identical model and system should not be assumed identical. If this is the desire, non-stochastic approaches should be considered instead.

Parts of Evaluation

There are five phases of evaluation for LLM output. Each phase has a unique set of needs. Although all of it is called "evaluation" there are useful differences for each phase. The phases of evaluation are during:

  • Prompt Development

  • Build/Unit Validation

  • Integration/Deployment

  • Feedback (Production Analysis)

  • Model Finetuning

At Okareo we have deconstructed evaluation into smaller units that we call Checks. This allows us to address the needs of each phase dynamically, drawing from a collection of checks that may overlap but are always relevant for each phase.

Scenarios

At this point, it is important to spend a moment on scenarios. Scenarios are inentional collections of inputs and ouputs along a specific variable dimension that is used to shine a light on the behavior of a model. If a LLM prompt seeks to generate code, then a scenario set should constitute dozens, hundreds or thousands of input/output pairs that can be used to explore (e.g. evaluate) dependency constraints set by the prompt. It is reasonable to assume that for each set of behaviors expected from an LLM interaction, there should be a corollary set of scenarios used to stretch and shine a light on the behavior.

Checks

Fundamentally, checks are metrics. The output of a check is either binary (pass/fail) or a numeric (1-n). Each check included in an evaluation is passed a scenario row and can use any combination of prompt, model input, model output, and model expected result to determine the appropriate score for the row. Binary checks are used to assert passing or failing. Numeric checks provide distributions. For example, a summary that must be less than 256 characters should be pass/fail. But a summary that should be "short" may just returns the length so we can review the distribution of "short" answers. Maybe the prompt needs to be "hyper short" or "no more than 50 words".

Okareo Checks come in three flavors:

  • Native Okareo Checks

  • Custom Deterministic

  • Custom Judge/Jury

Okareo comes with access to a growing list of native checks. Checks in this category are typically very generic and come largely from academic research. The list currently includes Levenshtein Distance, Bleue Score, Fluency, Consistency, and many more.

Custom deterministic checks are written in natural language, python or typescript (ts coming soon). Like all checks, these have access to the full range of context on the row they are evaluating. They are intended to be very fast and allow you to cycle through large numbers of scenarios very quickly. Deterministic checks will often make heavy use of language semantics and regular expressions to assert specific expectations on model output. For example, a check could pass/fail if the ouput is in markdown or has source references.

Judge/Jury checks are very powerful but need to be used carefully. Checks from this category use a model, often an LLM, to analyze output of another model. This process can be fast. But it isn't hard to unintentionally create a very expensive (time and money) judge. For example, the decomposition pattern where each sentence of an output is decomposed into a unique jury judgement and the judge re-assembles the completed list for a final score may mean 100s of calls per scenario. There are good reasons to do this. But for now, it won't be fast. So, use the judge/jury pattern thoughtfully.

Phases of Evaluation

Checks are a powerful tool for development, not just evaluation. The world of just build it and figure out testing later misses the power of evaluation during development.

If I was building a user interface, I would make a set of changes then look to see the unique change in the UI. If I was building a service, I would make changes and then run my unit test to see the unique change in the result. So, why wouldn't I do the same with prompts? Enter manual validation with a playground. Most playgrounds enable me to poke a model with a prompt and see the result. But where my UI or API change is discrete and deterministic. My prompt change is not. So, as useful as manual validation is for rapid feedback, it is ultimately a poor and slow approach to multi-behavior evaluation. The solution is chunked up evaluation where each chunk is phase appropriate. After all, I did not become more patient or more interested in QA just because I sterted using LLMs. But the range of opportunities to stub my toe, get barked at by the business, or slow my co-workers with regressions and unexpected behaviors have all increased if I don't adapt.

Prompt Development: Deterministic Distributions During development, use deterministic Checks that provide scores. This provides you with an understanding of response disributions. At this point "failure" is not important - possibly not even defined yet. You want to see the possible behaviors your model is returning and find the peaks, valleys, and steps your prompt is driving.

  • Establish 3-6 deterministic checks that you run during development

  • Each check should provide a metric that you can use to see the distribution of results

  • If pass/fail is appropriate, use it. But the goal of this stage is to understand results

  • Create enough metrics that you can use them to define the assertions for pass/fail regression

Build/Unit Validation: Deterministic Pass/Fail Unit testing is the gatekeeper to the check-in and PR. These should be deterministic checks that pass or fail. Although you may choose to maintain some number of distributional checks, the goal of this phase is rapid clarity on readiness. These are the checks you should use in CI whenever a prompt changes and prior to any merge. Pass/Fail checks should have paired with distributional checks that are either run at the same time or can be run if the model fails. The Okareo SDKSs allow you to make this determination lazily as part of the evaluation flow.

Integration/Deployment: Judge/Jury Now comes the domain of the judge and jury. Before deployment during integration validation, you should take the few additional minutes to validate that the system behaves as expected. LLMs are really good at this. The same non-determinism that makes it hard to pin down precise results is able to interpret interaction and provide visibility into expected vs actual results. Like checks oriented towards pass/fail and distributions, the judge/jury checks should use a combination of both approaches. However in this case, don't divide them. Use both at the same time. Since the process is relatively slow and expensive by definition, don't require it to be run twice. Do the full analysis and then determine readiness for deployment. Okareo has a variety or reporters for use in CI to set thresholds and determine readiness for deploy. Using checks at this stage allows you to lean into Continuous Deployment for LLM prompts.

Aren't LLMs poor at scoring? True, for scoring but not assessment. There are several techniques for addressing this difference. Judges can be limited to subsets of the total content and provide pass/fail judgement for each. For example, assess each sentence. The aggregate of the assessments can become the score. On the other hand a clear rubrik for scoring gives the stochastic system a way to categorize and find best fit to a score category. Simply saying "... and score from 1-100" is not going to be sufficient. Be specific about scores. Consider using a smaller scale like 1-5 but be specific about each position in the score domain.

Feedback (Production Analysis): Deterministic + Scenarios The production phase is one of the most varied from existing patterns. Although techniques like "canary" or "A/B" validation are fine for deployment stability and health, they have little to do with the proper functioning of the prompt or model. It is in production when all the unusual interactions start. Existing tools for observability and error reporting will indicate systemic stability. But behavior is different. The simplified thumbs-up, thumbs-down concept is useful but has problems. We will dedicate an entire blog to that in the future. The key to this phase is collecting feedback from users and from existing checks to alert behaviors that are well outside the expected norms. This phase also requires reputational and jail breaking checks to ensure that the model behaves within guidelines - not just operates. When interactions are outside of expected norms based on a subset of checks from above, it is useful to gather the input/output pairs and generate a synthetic scenario set that you can use to improve the prompt, data pipeline or model itself. This is one of the many entry points into fine tuning. We strongly suggest that interactions from this stage should lead to additional guardrail checks and expansions/modifications of your regression scenarios/checks. After all, the best regression suites are best on real-world interaction.

Model Fine Tuning: Expanded Scenarios This stage is listed last but really should be happening throughout the development lifecycle. Here the evaluations should focus on the gap between the original model and the fine-tuned model. Were enough fine-tuning examples provided? The deterministic and judgment checks should be used as part of the fine-tuning evaluation process to determine success. When complete, like Feedback, elements from the new scenarios and checks should be incorporated into the Build and Integration phases to protect future drift and regressions.

All Together Now

Okareo has a rich set of capabilities to help you evaluate complex model output. In this case, we looked at checks, an element of evaluation to drive metrics. Checks should be used at each phase of development. The most efficient and teams that we see use these methods throughout their development lifecycle. They don't leave it to the end as an afterthought as has been the historical norm with functional quality.

Assembling checks into useful evaluations that are time and cost efficient requires some thought. By thinking in terms of LLM development phase, we are able to limit time spent and maximize the signals of readiness.

Evaluating the output of our LLM components makes intuitive sense. But what to evaluate and when to do it is not as clear. From the world of functional testing, we know about the test pyramid that spans unit testing to integration testing and everything between. But model evaluation does not conform nicely to this structure. Data Scientists build models intentionally holding data back for validation and testing. So, what is the right approach to LLM evaluation?

Screenshot of evaluation results showing 7 key metrics including Consistency, Relevance, Fluency, Demo.Actions.Length, and Demo.Summary.JSON

What are evaluations

As developers working with LLM endpoints, we need to bridge our best practices for external API integration testing with learnings from Data Science where behavioral analysis during development is key. But before we jump in, let's agree on some common concepts.

  1. Behavior-based development. The key to effective use of LLMs is to embrace the lack of determinism and think in terms of behaviors.

  2. Transactional interaction. Most LLMs are called through standard API endpoints + payloads. This may seem counter-intuitive to call out. "Of course they are!" you say. Well true. It also allows us to use our well-understood API delivery habits. Remember, AI != LLM. AI can be accessed in a variety of fashions. For example a classifier may augment records in a datastore with meta-data, clustering labels, workflow steps, and more. But nearly all LLMs are called via API.

  3. Evaluation is more than testing. There seems to be a growing use of the term evaluation to mean test. This is far from correct. Because of behavior and transactional-ity, LLM evaluation is critical to development, validation, testing, and production feedback.

Evaluation should be thought of as a collection of distinict assertions on the output of a model. To drive evaluation, you must have a representative range of inputs from which to determine the distribution of assertion results. Two evaluation runs of an identical model and system should not be assumed identical. If this is the desire, non-stochastic approaches should be considered instead.

Parts of Evaluation

There are five phases of evaluation for LLM output. Each phase has a unique set of needs. Although all of it is called "evaluation" there are useful differences for each phase. The phases of evaluation are during:

  • Prompt Development

  • Build/Unit Validation

  • Integration/Deployment

  • Feedback (Production Analysis)

  • Model Finetuning

At Okareo we have deconstructed evaluation into smaller units that we call Checks. This allows us to address the needs of each phase dynamically, drawing from a collection of checks that may overlap but are always relevant for each phase.

Scenarios

At this point, it is important to spend a moment on scenarios. Scenarios are inentional collections of inputs and ouputs along a specific variable dimension that is used to shine a light on the behavior of a model. If a LLM prompt seeks to generate code, then a scenario set should constitute dozens, hundreds or thousands of input/output pairs that can be used to explore (e.g. evaluate) dependency constraints set by the prompt. It is reasonable to assume that for each set of behaviors expected from an LLM interaction, there should be a corollary set of scenarios used to stretch and shine a light on the behavior.

Checks

Fundamentally, checks are metrics. The output of a check is either binary (pass/fail) or a numeric (1-n). Each check included in an evaluation is passed a scenario row and can use any combination of prompt, model input, model output, and model expected result to determine the appropriate score for the row. Binary checks are used to assert passing or failing. Numeric checks provide distributions. For example, a summary that must be less than 256 characters should be pass/fail. But a summary that should be "short" may just returns the length so we can review the distribution of "short" answers. Maybe the prompt needs to be "hyper short" or "no more than 50 words".

Okareo Checks come in three flavors:

  • Native Okareo Checks

  • Custom Deterministic

  • Custom Judge/Jury

Okareo comes with access to a growing list of native checks. Checks in this category are typically very generic and come largely from academic research. The list currently includes Levenshtein Distance, Bleue Score, Fluency, Consistency, and many more.

Custom deterministic checks are written in natural language, python or typescript (ts coming soon). Like all checks, these have access to the full range of context on the row they are evaluating. They are intended to be very fast and allow you to cycle through large numbers of scenarios very quickly. Deterministic checks will often make heavy use of language semantics and regular expressions to assert specific expectations on model output. For example, a check could pass/fail if the ouput is in markdown or has source references.

Judge/Jury checks are very powerful but need to be used carefully. Checks from this category use a model, often an LLM, to analyze output of another model. This process can be fast. But it isn't hard to unintentionally create a very expensive (time and money) judge. For example, the decomposition pattern where each sentence of an output is decomposed into a unique jury judgement and the judge re-assembles the completed list for a final score may mean 100s of calls per scenario. There are good reasons to do this. But for now, it won't be fast. So, use the judge/jury pattern thoughtfully.

Phases of Evaluation

Checks are a powerful tool for development, not just evaluation. The world of just build it and figure out testing later misses the power of evaluation during development.

If I was building a user interface, I would make a set of changes then look to see the unique change in the UI. If I was building a service, I would make changes and then run my unit test to see the unique change in the result. So, why wouldn't I do the same with prompts? Enter manual validation with a playground. Most playgrounds enable me to poke a model with a prompt and see the result. But where my UI or API change is discrete and deterministic. My prompt change is not. So, as useful as manual validation is for rapid feedback, it is ultimately a poor and slow approach to multi-behavior evaluation. The solution is chunked up evaluation where each chunk is phase appropriate. After all, I did not become more patient or more interested in QA just because I sterted using LLMs. But the range of opportunities to stub my toe, get barked at by the business, or slow my co-workers with regressions and unexpected behaviors have all increased if I don't adapt.

Prompt Development: Deterministic Distributions During development, use deterministic Checks that provide scores. This provides you with an understanding of response disributions. At this point "failure" is not important - possibly not even defined yet. You want to see the possible behaviors your model is returning and find the peaks, valleys, and steps your prompt is driving.

  • Establish 3-6 deterministic checks that you run during development

  • Each check should provide a metric that you can use to see the distribution of results

  • If pass/fail is appropriate, use it. But the goal of this stage is to understand results

  • Create enough metrics that you can use them to define the assertions for pass/fail regression

Build/Unit Validation: Deterministic Pass/Fail Unit testing is the gatekeeper to the check-in and PR. These should be deterministic checks that pass or fail. Although you may choose to maintain some number of distributional checks, the goal of this phase is rapid clarity on readiness. These are the checks you should use in CI whenever a prompt changes and prior to any merge. Pass/Fail checks should have paired with distributional checks that are either run at the same time or can be run if the model fails. The Okareo SDKSs allow you to make this determination lazily as part of the evaluation flow.

Integration/Deployment: Judge/Jury Now comes the domain of the judge and jury. Before deployment during integration validation, you should take the few additional minutes to validate that the system behaves as expected. LLMs are really good at this. The same non-determinism that makes it hard to pin down precise results is able to interpret interaction and provide visibility into expected vs actual results. Like checks oriented towards pass/fail and distributions, the judge/jury checks should use a combination of both approaches. However in this case, don't divide them. Use both at the same time. Since the process is relatively slow and expensive by definition, don't require it to be run twice. Do the full analysis and then determine readiness for deployment. Okareo has a variety or reporters for use in CI to set thresholds and determine readiness for deploy. Using checks at this stage allows you to lean into Continuous Deployment for LLM prompts.

Aren't LLMs poor at scoring? True, for scoring but not assessment. There are several techniques for addressing this difference. Judges can be limited to subsets of the total content and provide pass/fail judgement for each. For example, assess each sentence. The aggregate of the assessments can become the score. On the other hand a clear rubrik for scoring gives the stochastic system a way to categorize and find best fit to a score category. Simply saying "... and score from 1-100" is not going to be sufficient. Be specific about scores. Consider using a smaller scale like 1-5 but be specific about each position in the score domain.

Feedback (Production Analysis): Deterministic + Scenarios The production phase is one of the most varied from existing patterns. Although techniques like "canary" or "A/B" validation are fine for deployment stability and health, they have little to do with the proper functioning of the prompt or model. It is in production when all the unusual interactions start. Existing tools for observability and error reporting will indicate systemic stability. But behavior is different. The simplified thumbs-up, thumbs-down concept is useful but has problems. We will dedicate an entire blog to that in the future. The key to this phase is collecting feedback from users and from existing checks to alert behaviors that are well outside the expected norms. This phase also requires reputational and jail breaking checks to ensure that the model behaves within guidelines - not just operates. When interactions are outside of expected norms based on a subset of checks from above, it is useful to gather the input/output pairs and generate a synthetic scenario set that you can use to improve the prompt, data pipeline or model itself. This is one of the many entry points into fine tuning. We strongly suggest that interactions from this stage should lead to additional guardrail checks and expansions/modifications of your regression scenarios/checks. After all, the best regression suites are best on real-world interaction.

Model Fine Tuning: Expanded Scenarios This stage is listed last but really should be happening throughout the development lifecycle. Here the evaluations should focus on the gap between the original model and the fine-tuned model. Were enough fine-tuning examples provided? The deterministic and judgment checks should be used as part of the fine-tuning evaluation process to determine success. When complete, like Feedback, elements from the new scenarios and checks should be incorporated into the Build and Integration phases to protect future drift and regressions.

All Together Now

Okareo has a rich set of capabilities to help you evaluate complex model output. In this case, we looked at checks, an element of evaluation to drive metrics. Checks should be used at each phase of development. The most efficient and teams that we see use these methods throughout their development lifecycle. They don't leave it to the end as an afterthought as has been the historical norm with functional quality.

Assembling checks into useful evaluations that are time and cost efficient requires some thought. By thinking in terms of LLM development phase, we are able to limit time spent and maximize the signals of readiness.

Evaluating the output of our LLM components makes intuitive sense. But what to evaluate and when to do it is not as clear. From the world of functional testing, we know about the test pyramid that spans unit testing to integration testing and everything between. But model evaluation does not conform nicely to this structure. Data Scientists build models intentionally holding data back for validation and testing. So, what is the right approach to LLM evaluation?

Screenshot of evaluation results showing 7 key metrics including Consistency, Relevance, Fluency, Demo.Actions.Length, and Demo.Summary.JSON

What are evaluations

As developers working with LLM endpoints, we need to bridge our best practices for external API integration testing with learnings from Data Science where behavioral analysis during development is key. But before we jump in, let's agree on some common concepts.

  1. Behavior-based development. The key to effective use of LLMs is to embrace the lack of determinism and think in terms of behaviors.

  2. Transactional interaction. Most LLMs are called through standard API endpoints + payloads. This may seem counter-intuitive to call out. "Of course they are!" you say. Well true. It also allows us to use our well-understood API delivery habits. Remember, AI != LLM. AI can be accessed in a variety of fashions. For example a classifier may augment records in a datastore with meta-data, clustering labels, workflow steps, and more. But nearly all LLMs are called via API.

  3. Evaluation is more than testing. There seems to be a growing use of the term evaluation to mean test. This is far from correct. Because of behavior and transactional-ity, LLM evaluation is critical to development, validation, testing, and production feedback.

Evaluation should be thought of as a collection of distinict assertions on the output of a model. To drive evaluation, you must have a representative range of inputs from which to determine the distribution of assertion results. Two evaluation runs of an identical model and system should not be assumed identical. If this is the desire, non-stochastic approaches should be considered instead.

Parts of Evaluation

There are five phases of evaluation for LLM output. Each phase has a unique set of needs. Although all of it is called "evaluation" there are useful differences for each phase. The phases of evaluation are during:

  • Prompt Development

  • Build/Unit Validation

  • Integration/Deployment

  • Feedback (Production Analysis)

  • Model Finetuning

At Okareo we have deconstructed evaluation into smaller units that we call Checks. This allows us to address the needs of each phase dynamically, drawing from a collection of checks that may overlap but are always relevant for each phase.

Scenarios

At this point, it is important to spend a moment on scenarios. Scenarios are inentional collections of inputs and ouputs along a specific variable dimension that is used to shine a light on the behavior of a model. If a LLM prompt seeks to generate code, then a scenario set should constitute dozens, hundreds or thousands of input/output pairs that can be used to explore (e.g. evaluate) dependency constraints set by the prompt. It is reasonable to assume that for each set of behaviors expected from an LLM interaction, there should be a corollary set of scenarios used to stretch and shine a light on the behavior.

Checks

Fundamentally, checks are metrics. The output of a check is either binary (pass/fail) or a numeric (1-n). Each check included in an evaluation is passed a scenario row and can use any combination of prompt, model input, model output, and model expected result to determine the appropriate score for the row. Binary checks are used to assert passing or failing. Numeric checks provide distributions. For example, a summary that must be less than 256 characters should be pass/fail. But a summary that should be "short" may just returns the length so we can review the distribution of "short" answers. Maybe the prompt needs to be "hyper short" or "no more than 50 words".

Okareo Checks come in three flavors:

  • Native Okareo Checks

  • Custom Deterministic

  • Custom Judge/Jury

Okareo comes with access to a growing list of native checks. Checks in this category are typically very generic and come largely from academic research. The list currently includes Levenshtein Distance, Bleue Score, Fluency, Consistency, and many more.

Custom deterministic checks are written in natural language, python or typescript (ts coming soon). Like all checks, these have access to the full range of context on the row they are evaluating. They are intended to be very fast and allow you to cycle through large numbers of scenarios very quickly. Deterministic checks will often make heavy use of language semantics and regular expressions to assert specific expectations on model output. For example, a check could pass/fail if the ouput is in markdown or has source references.

Judge/Jury checks are very powerful but need to be used carefully. Checks from this category use a model, often an LLM, to analyze output of another model. This process can be fast. But it isn't hard to unintentionally create a very expensive (time and money) judge. For example, the decomposition pattern where each sentence of an output is decomposed into a unique jury judgement and the judge re-assembles the completed list for a final score may mean 100s of calls per scenario. There are good reasons to do this. But for now, it won't be fast. So, use the judge/jury pattern thoughtfully.

Phases of Evaluation

Checks are a powerful tool for development, not just evaluation. The world of just build it and figure out testing later misses the power of evaluation during development.

If I was building a user interface, I would make a set of changes then look to see the unique change in the UI. If I was building a service, I would make changes and then run my unit test to see the unique change in the result. So, why wouldn't I do the same with prompts? Enter manual validation with a playground. Most playgrounds enable me to poke a model with a prompt and see the result. But where my UI or API change is discrete and deterministic. My prompt change is not. So, as useful as manual validation is for rapid feedback, it is ultimately a poor and slow approach to multi-behavior evaluation. The solution is chunked up evaluation where each chunk is phase appropriate. After all, I did not become more patient or more interested in QA just because I sterted using LLMs. But the range of opportunities to stub my toe, get barked at by the business, or slow my co-workers with regressions and unexpected behaviors have all increased if I don't adapt.

Prompt Development: Deterministic Distributions During development, use deterministic Checks that provide scores. This provides you with an understanding of response disributions. At this point "failure" is not important - possibly not even defined yet. You want to see the possible behaviors your model is returning and find the peaks, valleys, and steps your prompt is driving.

  • Establish 3-6 deterministic checks that you run during development

  • Each check should provide a metric that you can use to see the distribution of results

  • If pass/fail is appropriate, use it. But the goal of this stage is to understand results

  • Create enough metrics that you can use them to define the assertions for pass/fail regression

Build/Unit Validation: Deterministic Pass/Fail Unit testing is the gatekeeper to the check-in and PR. These should be deterministic checks that pass or fail. Although you may choose to maintain some number of distributional checks, the goal of this phase is rapid clarity on readiness. These are the checks you should use in CI whenever a prompt changes and prior to any merge. Pass/Fail checks should have paired with distributional checks that are either run at the same time or can be run if the model fails. The Okareo SDKSs allow you to make this determination lazily as part of the evaluation flow.

Integration/Deployment: Judge/Jury Now comes the domain of the judge and jury. Before deployment during integration validation, you should take the few additional minutes to validate that the system behaves as expected. LLMs are really good at this. The same non-determinism that makes it hard to pin down precise results is able to interpret interaction and provide visibility into expected vs actual results. Like checks oriented towards pass/fail and distributions, the judge/jury checks should use a combination of both approaches. However in this case, don't divide them. Use both at the same time. Since the process is relatively slow and expensive by definition, don't require it to be run twice. Do the full analysis and then determine readiness for deployment. Okareo has a variety or reporters for use in CI to set thresholds and determine readiness for deploy. Using checks at this stage allows you to lean into Continuous Deployment for LLM prompts.

Aren't LLMs poor at scoring? True, for scoring but not assessment. There are several techniques for addressing this difference. Judges can be limited to subsets of the total content and provide pass/fail judgement for each. For example, assess each sentence. The aggregate of the assessments can become the score. On the other hand a clear rubrik for scoring gives the stochastic system a way to categorize and find best fit to a score category. Simply saying "... and score from 1-100" is not going to be sufficient. Be specific about scores. Consider using a smaller scale like 1-5 but be specific about each position in the score domain.

Feedback (Production Analysis): Deterministic + Scenarios The production phase is one of the most varied from existing patterns. Although techniques like "canary" or "A/B" validation are fine for deployment stability and health, they have little to do with the proper functioning of the prompt or model. It is in production when all the unusual interactions start. Existing tools for observability and error reporting will indicate systemic stability. But behavior is different. The simplified thumbs-up, thumbs-down concept is useful but has problems. We will dedicate an entire blog to that in the future. The key to this phase is collecting feedback from users and from existing checks to alert behaviors that are well outside the expected norms. This phase also requires reputational and jail breaking checks to ensure that the model behaves within guidelines - not just operates. When interactions are outside of expected norms based on a subset of checks from above, it is useful to gather the input/output pairs and generate a synthetic scenario set that you can use to improve the prompt, data pipeline or model itself. This is one of the many entry points into fine tuning. We strongly suggest that interactions from this stage should lead to additional guardrail checks and expansions/modifications of your regression scenarios/checks. After all, the best regression suites are best on real-world interaction.

Model Fine Tuning: Expanded Scenarios This stage is listed last but really should be happening throughout the development lifecycle. Here the evaluations should focus on the gap between the original model and the fine-tuned model. Were enough fine-tuning examples provided? The deterministic and judgment checks should be used as part of the fine-tuning evaluation process to determine success. When complete, like Feedback, elements from the new scenarios and checks should be incorporated into the Build and Integration phases to protect future drift and regressions.

All Together Now

Okareo has a rich set of capabilities to help you evaluate complex model output. In this case, we looked at checks, an element of evaluation to drive metrics. Checks should be used at each phase of development. The most efficient and teams that we see use these methods throughout their development lifecycle. They don't leave it to the end as an afterthought as has been the historical norm with functional quality.

Assembling checks into useful evaluations that are time and cost efficient requires some thought. By thinking in terms of LLM development phase, we are able to limit time spent and maximize the signals of readiness.

Evaluating the output of our LLM components makes intuitive sense. But what to evaluate and when to do it is not as clear. From the world of functional testing, we know about the test pyramid that spans unit testing to integration testing and everything between. But model evaluation does not conform nicely to this structure. Data Scientists build models intentionally holding data back for validation and testing. So, what is the right approach to LLM evaluation?

Screenshot of evaluation results showing 7 key metrics including Consistency, Relevance, Fluency, Demo.Actions.Length, and Demo.Summary.JSON

What are evaluations

As developers working with LLM endpoints, we need to bridge our best practices for external API integration testing with learnings from Data Science where behavioral analysis during development is key. But before we jump in, let's agree on some common concepts.

  1. Behavior-based development. The key to effective use of LLMs is to embrace the lack of determinism and think in terms of behaviors.

  2. Transactional interaction. Most LLMs are called through standard API endpoints + payloads. This may seem counter-intuitive to call out. "Of course they are!" you say. Well true. It also allows us to use our well-understood API delivery habits. Remember, AI != LLM. AI can be accessed in a variety of fashions. For example a classifier may augment records in a datastore with meta-data, clustering labels, workflow steps, and more. But nearly all LLMs are called via API.

  3. Evaluation is more than testing. There seems to be a growing use of the term evaluation to mean test. This is far from correct. Because of behavior and transactional-ity, LLM evaluation is critical to development, validation, testing, and production feedback.

Evaluation should be thought of as a collection of distinict assertions on the output of a model. To drive evaluation, you must have a representative range of inputs from which to determine the distribution of assertion results. Two evaluation runs of an identical model and system should not be assumed identical. If this is the desire, non-stochastic approaches should be considered instead.

Parts of Evaluation

There are five phases of evaluation for LLM output. Each phase has a unique set of needs. Although all of it is called "evaluation" there are useful differences for each phase. The phases of evaluation are during:

  • Prompt Development

  • Build/Unit Validation

  • Integration/Deployment

  • Feedback (Production Analysis)

  • Model Finetuning

At Okareo we have deconstructed evaluation into smaller units that we call Checks. This allows us to address the needs of each phase dynamically, drawing from a collection of checks that may overlap but are always relevant for each phase.

Scenarios

At this point, it is important to spend a moment on scenarios. Scenarios are inentional collections of inputs and ouputs along a specific variable dimension that is used to shine a light on the behavior of a model. If a LLM prompt seeks to generate code, then a scenario set should constitute dozens, hundreds or thousands of input/output pairs that can be used to explore (e.g. evaluate) dependency constraints set by the prompt. It is reasonable to assume that for each set of behaviors expected from an LLM interaction, there should be a corollary set of scenarios used to stretch and shine a light on the behavior.

Checks

Fundamentally, checks are metrics. The output of a check is either binary (pass/fail) or a numeric (1-n). Each check included in an evaluation is passed a scenario row and can use any combination of prompt, model input, model output, and model expected result to determine the appropriate score for the row. Binary checks are used to assert passing or failing. Numeric checks provide distributions. For example, a summary that must be less than 256 characters should be pass/fail. But a summary that should be "short" may just returns the length so we can review the distribution of "short" answers. Maybe the prompt needs to be "hyper short" or "no more than 50 words".

Okareo Checks come in three flavors:

  • Native Okareo Checks

  • Custom Deterministic

  • Custom Judge/Jury

Okareo comes with access to a growing list of native checks. Checks in this category are typically very generic and come largely from academic research. The list currently includes Levenshtein Distance, Bleue Score, Fluency, Consistency, and many more.

Custom deterministic checks are written in natural language, python or typescript (ts coming soon). Like all checks, these have access to the full range of context on the row they are evaluating. They are intended to be very fast and allow you to cycle through large numbers of scenarios very quickly. Deterministic checks will often make heavy use of language semantics and regular expressions to assert specific expectations on model output. For example, a check could pass/fail if the ouput is in markdown or has source references.

Judge/Jury checks are very powerful but need to be used carefully. Checks from this category use a model, often an LLM, to analyze output of another model. This process can be fast. But it isn't hard to unintentionally create a very expensive (time and money) judge. For example, the decomposition pattern where each sentence of an output is decomposed into a unique jury judgement and the judge re-assembles the completed list for a final score may mean 100s of calls per scenario. There are good reasons to do this. But for now, it won't be fast. So, use the judge/jury pattern thoughtfully.

Phases of Evaluation

Checks are a powerful tool for development, not just evaluation. The world of just build it and figure out testing later misses the power of evaluation during development.

If I was building a user interface, I would make a set of changes then look to see the unique change in the UI. If I was building a service, I would make changes and then run my unit test to see the unique change in the result. So, why wouldn't I do the same with prompts? Enter manual validation with a playground. Most playgrounds enable me to poke a model with a prompt and see the result. But where my UI or API change is discrete and deterministic. My prompt change is not. So, as useful as manual validation is for rapid feedback, it is ultimately a poor and slow approach to multi-behavior evaluation. The solution is chunked up evaluation where each chunk is phase appropriate. After all, I did not become more patient or more interested in QA just because I sterted using LLMs. But the range of opportunities to stub my toe, get barked at by the business, or slow my co-workers with regressions and unexpected behaviors have all increased if I don't adapt.

Prompt Development: Deterministic Distributions During development, use deterministic Checks that provide scores. This provides you with an understanding of response disributions. At this point "failure" is not important - possibly not even defined yet. You want to see the possible behaviors your model is returning and find the peaks, valleys, and steps your prompt is driving.

  • Establish 3-6 deterministic checks that you run during development

  • Each check should provide a metric that you can use to see the distribution of results

  • If pass/fail is appropriate, use it. But the goal of this stage is to understand results

  • Create enough metrics that you can use them to define the assertions for pass/fail regression

Build/Unit Validation: Deterministic Pass/Fail Unit testing is the gatekeeper to the check-in and PR. These should be deterministic checks that pass or fail. Although you may choose to maintain some number of distributional checks, the goal of this phase is rapid clarity on readiness. These are the checks you should use in CI whenever a prompt changes and prior to any merge. Pass/Fail checks should have paired with distributional checks that are either run at the same time or can be run if the model fails. The Okareo SDKSs allow you to make this determination lazily as part of the evaluation flow.

Integration/Deployment: Judge/Jury Now comes the domain of the judge and jury. Before deployment during integration validation, you should take the few additional minutes to validate that the system behaves as expected. LLMs are really good at this. The same non-determinism that makes it hard to pin down precise results is able to interpret interaction and provide visibility into expected vs actual results. Like checks oriented towards pass/fail and distributions, the judge/jury checks should use a combination of both approaches. However in this case, don't divide them. Use both at the same time. Since the process is relatively slow and expensive by definition, don't require it to be run twice. Do the full analysis and then determine readiness for deployment. Okareo has a variety or reporters for use in CI to set thresholds and determine readiness for deploy. Using checks at this stage allows you to lean into Continuous Deployment for LLM prompts.

Aren't LLMs poor at scoring? True, for scoring but not assessment. There are several techniques for addressing this difference. Judges can be limited to subsets of the total content and provide pass/fail judgement for each. For example, assess each sentence. The aggregate of the assessments can become the score. On the other hand a clear rubrik for scoring gives the stochastic system a way to categorize and find best fit to a score category. Simply saying "... and score from 1-100" is not going to be sufficient. Be specific about scores. Consider using a smaller scale like 1-5 but be specific about each position in the score domain.

Feedback (Production Analysis): Deterministic + Scenarios The production phase is one of the most varied from existing patterns. Although techniques like "canary" or "A/B" validation are fine for deployment stability and health, they have little to do with the proper functioning of the prompt or model. It is in production when all the unusual interactions start. Existing tools for observability and error reporting will indicate systemic stability. But behavior is different. The simplified thumbs-up, thumbs-down concept is useful but has problems. We will dedicate an entire blog to that in the future. The key to this phase is collecting feedback from users and from existing checks to alert behaviors that are well outside the expected norms. This phase also requires reputational and jail breaking checks to ensure that the model behaves within guidelines - not just operates. When interactions are outside of expected norms based on a subset of checks from above, it is useful to gather the input/output pairs and generate a synthetic scenario set that you can use to improve the prompt, data pipeline or model itself. This is one of the many entry points into fine tuning. We strongly suggest that interactions from this stage should lead to additional guardrail checks and expansions/modifications of your regression scenarios/checks. After all, the best regression suites are best on real-world interaction.

Model Fine Tuning: Expanded Scenarios This stage is listed last but really should be happening throughout the development lifecycle. Here the evaluations should focus on the gap between the original model and the fine-tuned model. Were enough fine-tuning examples provided? The deterministic and judgment checks should be used as part of the fine-tuning evaluation process to determine success. When complete, like Feedback, elements from the new scenarios and checks should be incorporated into the Build and Integration phases to protect future drift and regressions.

All Together Now

Okareo has a rich set of capabilities to help you evaluate complex model output. In this case, we looked at checks, an element of evaluation to drive metrics. Checks should be used at each phase of development. The most efficient and teams that we see use these methods throughout their development lifecycle. They don't leave it to the end as an afterthought as has been the historical norm with functional quality.

Assembling checks into useful evaluations that are time and cost efficient requires some thought. By thinking in terms of LLM development phase, we are able to limit time spent and maximize the signals of readiness.

Share:

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.