Comparing Models Claude Sonnet 3.5 vs GPT 3.5

RAG

Hrithik Datta

,

Founding Software Engineer

August 20, 2024

With new models and model versions appearing every week, it is useful to have a go-to process for comparing models. Despite the published benchmarks, you don't really know if a model will work until you compare it to your existing baselines. So, the solution is to have a ready-made set of scenarios and evaluations that you can use to compare and baseline new models with your use cases. In this example, we will compare Claude Sonnet 3.5 and GPT-3.5. You can use this approach to compare any set of models.

Why Baseline With Your Use Case?

Every product and industry has unique language models needs. While general benchmarks exist, they do not reflect the specific needs of your use case. By creating synthetic scenarios and evaluation checks unique to your use case needs, you can:

  1. Establish baselines specific to your task

  2. Quickly compare multiple models against these baselines

  3. Make data-driven decisions about which model best suits your needs

Setting Up Your Comparison

Step 1: Register the Models

First, you'll need to register the models you want to compare in Okareo. This involves:

  • Defining how to interact with each model (API calls, etc.)

  • Setting up any necessary prompts or instructions

# Code for registering models (Claude and GPT-3.5) in Okareo 
gpt_model = okareo.register_model(
  name="GPT-3.5",
  model=OpenAIModel(...

Step 2: Create a Scenario to Evaluate

Scenarios are crucial for meaningful comparisons. They allow you to:

  • Define realistic inputs that represent your actual use case

  • Provide baseline responses

  • Ensure consistency across model evaluations

When creating a scenario, consider:

  • The variety of inputs you need to test

  • Edge cases or challenging examples

  • A representative sample of your expected workload

# Code for creating a scenario in Okareo 
scenario_set_create = ScenarioSetCreate(
  name="Your Scenario Name",
  seed_data=[
    SeedData(
      input_="Can I connect my CRM to your platform?",
      result="Certainly ..."         
    ),
    ...

Step 3: Create a Model-Based Check

Checks allow you to quantify the performance of each model. A model-based check can:

  • Evaluate generated responses against specific criteria

  • Provide consistent scoring across different models

  • Focus on aspects most important to your use case

# Code for creating a model-based check in Okareo 
check_response_quality = okareo.create_or_update_check(
  name=f"check_response_quality",
  description="Score the quality of the generated customer support response",
  check=ModelBasedCheck(
    prompt_template="""
      Evaluate the quality of the generated customer support response based on the following criteria:
        1. Relevance to the inquiry
        2. Clarity and coherence... 

Step 4: Run the evaluations

After running the evaluations, you'll want to:

  • Apply your check to the generated responses

  • Review the results to understand each model's performance

  • Compare the models based on the criteria most important to your use case

Step 5: Interpreting the Results

When reviewing your comparison results:

  1. Look for patterns in where each model excels or struggles

  2. Consider how the performance aligns with your specific needs

  3. Determine if the differences are significant enough to impact your decision

  4. Think about any trade-offs between performance and other factors (cost, latency, etc.)

Remember, the goal is not just to find the "best" model, but the one that best fits your unique requirements and constraints.

By using Okareo to create custom scenarios and checks, you can make informed decisions about which language model is best suited for your specific tasks and industry needs.

With new models and model versions appearing every week, it is useful to have a go-to process for comparing models. Despite the published benchmarks, you don't really know if a model will work until you compare it to your existing baselines. So, the solution is to have a ready-made set of scenarios and evaluations that you can use to compare and baseline new models with your use cases. In this example, we will compare Claude Sonnet 3.5 and GPT-3.5. You can use this approach to compare any set of models.

Why Baseline With Your Use Case?

Every product and industry has unique language models needs. While general benchmarks exist, they do not reflect the specific needs of your use case. By creating synthetic scenarios and evaluation checks unique to your use case needs, you can:

  1. Establish baselines specific to your task

  2. Quickly compare multiple models against these baselines

  3. Make data-driven decisions about which model best suits your needs

Setting Up Your Comparison

Step 1: Register the Models

First, you'll need to register the models you want to compare in Okareo. This involves:

  • Defining how to interact with each model (API calls, etc.)

  • Setting up any necessary prompts or instructions

# Code for registering models (Claude and GPT-3.5) in Okareo 
gpt_model = okareo.register_model(
  name="GPT-3.5",
  model=OpenAIModel(...

Step 2: Create a Scenario to Evaluate

Scenarios are crucial for meaningful comparisons. They allow you to:

  • Define realistic inputs that represent your actual use case

  • Provide baseline responses

  • Ensure consistency across model evaluations

When creating a scenario, consider:

  • The variety of inputs you need to test

  • Edge cases or challenging examples

  • A representative sample of your expected workload

# Code for creating a scenario in Okareo 
scenario_set_create = ScenarioSetCreate(
  name="Your Scenario Name",
  seed_data=[
    SeedData(
      input_="Can I connect my CRM to your platform?",
      result="Certainly ..."         
    ),
    ...

Step 3: Create a Model-Based Check

Checks allow you to quantify the performance of each model. A model-based check can:

  • Evaluate generated responses against specific criteria

  • Provide consistent scoring across different models

  • Focus on aspects most important to your use case

# Code for creating a model-based check in Okareo 
check_response_quality = okareo.create_or_update_check(
  name=f"check_response_quality",
  description="Score the quality of the generated customer support response",
  check=ModelBasedCheck(
    prompt_template="""
      Evaluate the quality of the generated customer support response based on the following criteria:
        1. Relevance to the inquiry
        2. Clarity and coherence... 

Step 4: Run the evaluations

After running the evaluations, you'll want to:

  • Apply your check to the generated responses

  • Review the results to understand each model's performance

  • Compare the models based on the criteria most important to your use case

Step 5: Interpreting the Results

When reviewing your comparison results:

  1. Look for patterns in where each model excels or struggles

  2. Consider how the performance aligns with your specific needs

  3. Determine if the differences are significant enough to impact your decision

  4. Think about any trade-offs between performance and other factors (cost, latency, etc.)

Remember, the goal is not just to find the "best" model, but the one that best fits your unique requirements and constraints.

By using Okareo to create custom scenarios and checks, you can make informed decisions about which language model is best suited for your specific tasks and industry needs.

With new models and model versions appearing every week, it is useful to have a go-to process for comparing models. Despite the published benchmarks, you don't really know if a model will work until you compare it to your existing baselines. So, the solution is to have a ready-made set of scenarios and evaluations that you can use to compare and baseline new models with your use cases. In this example, we will compare Claude Sonnet 3.5 and GPT-3.5. You can use this approach to compare any set of models.

Why Baseline With Your Use Case?

Every product and industry has unique language models needs. While general benchmarks exist, they do not reflect the specific needs of your use case. By creating synthetic scenarios and evaluation checks unique to your use case needs, you can:

  1. Establish baselines specific to your task

  2. Quickly compare multiple models against these baselines

  3. Make data-driven decisions about which model best suits your needs

Setting Up Your Comparison

Step 1: Register the Models

First, you'll need to register the models you want to compare in Okareo. This involves:

  • Defining how to interact with each model (API calls, etc.)

  • Setting up any necessary prompts or instructions

# Code for registering models (Claude and GPT-3.5) in Okareo 
gpt_model = okareo.register_model(
  name="GPT-3.5",
  model=OpenAIModel(...

Step 2: Create a Scenario to Evaluate

Scenarios are crucial for meaningful comparisons. They allow you to:

  • Define realistic inputs that represent your actual use case

  • Provide baseline responses

  • Ensure consistency across model evaluations

When creating a scenario, consider:

  • The variety of inputs you need to test

  • Edge cases or challenging examples

  • A representative sample of your expected workload

# Code for creating a scenario in Okareo 
scenario_set_create = ScenarioSetCreate(
  name="Your Scenario Name",
  seed_data=[
    SeedData(
      input_="Can I connect my CRM to your platform?",
      result="Certainly ..."         
    ),
    ...

Step 3: Create a Model-Based Check

Checks allow you to quantify the performance of each model. A model-based check can:

  • Evaluate generated responses against specific criteria

  • Provide consistent scoring across different models

  • Focus on aspects most important to your use case

# Code for creating a model-based check in Okareo 
check_response_quality = okareo.create_or_update_check(
  name=f"check_response_quality",
  description="Score the quality of the generated customer support response",
  check=ModelBasedCheck(
    prompt_template="""
      Evaluate the quality of the generated customer support response based on the following criteria:
        1. Relevance to the inquiry
        2. Clarity and coherence... 

Step 4: Run the evaluations

After running the evaluations, you'll want to:

  • Apply your check to the generated responses

  • Review the results to understand each model's performance

  • Compare the models based on the criteria most important to your use case

Step 5: Interpreting the Results

When reviewing your comparison results:

  1. Look for patterns in where each model excels or struggles

  2. Consider how the performance aligns with your specific needs

  3. Determine if the differences are significant enough to impact your decision

  4. Think about any trade-offs between performance and other factors (cost, latency, etc.)

Remember, the goal is not just to find the "best" model, but the one that best fits your unique requirements and constraints.

By using Okareo to create custom scenarios and checks, you can make informed decisions about which language model is best suited for your specific tasks and industry needs.

With new models and model versions appearing every week, it is useful to have a go-to process for comparing models. Despite the published benchmarks, you don't really know if a model will work until you compare it to your existing baselines. So, the solution is to have a ready-made set of scenarios and evaluations that you can use to compare and baseline new models with your use cases. In this example, we will compare Claude Sonnet 3.5 and GPT-3.5. You can use this approach to compare any set of models.

Why Baseline With Your Use Case?

Every product and industry has unique language models needs. While general benchmarks exist, they do not reflect the specific needs of your use case. By creating synthetic scenarios and evaluation checks unique to your use case needs, you can:

  1. Establish baselines specific to your task

  2. Quickly compare multiple models against these baselines

  3. Make data-driven decisions about which model best suits your needs

Setting Up Your Comparison

Step 1: Register the Models

First, you'll need to register the models you want to compare in Okareo. This involves:

  • Defining how to interact with each model (API calls, etc.)

  • Setting up any necessary prompts or instructions

# Code for registering models (Claude and GPT-3.5) in Okareo 
gpt_model = okareo.register_model(
  name="GPT-3.5",
  model=OpenAIModel(...

Step 2: Create a Scenario to Evaluate

Scenarios are crucial for meaningful comparisons. They allow you to:

  • Define realistic inputs that represent your actual use case

  • Provide baseline responses

  • Ensure consistency across model evaluations

When creating a scenario, consider:

  • The variety of inputs you need to test

  • Edge cases or challenging examples

  • A representative sample of your expected workload

# Code for creating a scenario in Okareo 
scenario_set_create = ScenarioSetCreate(
  name="Your Scenario Name",
  seed_data=[
    SeedData(
      input_="Can I connect my CRM to your platform?",
      result="Certainly ..."         
    ),
    ...

Step 3: Create a Model-Based Check

Checks allow you to quantify the performance of each model. A model-based check can:

  • Evaluate generated responses against specific criteria

  • Provide consistent scoring across different models

  • Focus on aspects most important to your use case

# Code for creating a model-based check in Okareo 
check_response_quality = okareo.create_or_update_check(
  name=f"check_response_quality",
  description="Score the quality of the generated customer support response",
  check=ModelBasedCheck(
    prompt_template="""
      Evaluate the quality of the generated customer support response based on the following criteria:
        1. Relevance to the inquiry
        2. Clarity and coherence... 

Step 4: Run the evaluations

After running the evaluations, you'll want to:

  • Apply your check to the generated responses

  • Review the results to understand each model's performance

  • Compare the models based on the criteria most important to your use case

Step 5: Interpreting the Results

When reviewing your comparison results:

  1. Look for patterns in where each model excels or struggles

  2. Consider how the performance aligns with your specific needs

  3. Determine if the differences are significant enough to impact your decision

  4. Think about any trade-offs between performance and other factors (cost, latency, etc.)

Remember, the goal is not just to find the "best" model, but the one that best fits your unique requirements and constraints.

By using Okareo to create custom scenarios and checks, you can make informed decisions about which language model is best suited for your specific tasks and industry needs.

Share:

Join the trusted

Future of AI

Get started delivering models your customers can rely on.

Join the trusted

Future of AI

Get started delivering models your customers can rely on.