Comparing Models Claude Sonnet 3.5 vs GPT 3.5

RAG

Hrithik Datta
Founding Software Engineer
August 20, 2024
With new models and model versions appearing every week, it is useful to have a go-to process for comparing models. Despite the published benchmarks, you don't really know if a model will work until you compare it to your existing baselines. So, the solution is to have a ready-made set of scenarios and evaluations that you can use to compare and baseline new models with your use cases. In this example, we will compare Claude Sonnet 3.5 and GPT-3.5. You can use this approach to compare any set of models.
Why Baseline With Your Use Case?
Every product and industry has unique language models needs. While general benchmarks exist, they do not reflect the specific needs of your use case. By creating synthetic scenarios and evaluation checks unique to your use case needs, you can:
Establish baselines specific to your task
Quickly compare multiple models against these baselines
Make data-driven decisions about which model best suits your needs

Setting Up Your Comparison
Step 1: Register the Models
First, you'll need to register the models you want to compare in Okareo. This involves:
Defining how to interact with each model (API calls, etc.)
Setting up any necessary prompts or instructions
Step 2: Create a Scenario to Evaluate
Scenarios are crucial for meaningful comparisons. They allow you to:
Define realistic inputs that represent your actual use case
Provide baseline responses
Ensure consistency across model evaluations
When creating a scenario, consider:
The variety of inputs you need to test
Edge cases or challenging examples
A representative sample of your expected workload
Step 3: Create a Model-Based Check
Checks allow you to quantify the performance of each model. A model-based check can:
Evaluate generated responses against specific criteria
Provide consistent scoring across different models
Focus on aspects most important to your use case
Step 4: Run the evaluations
After running the evaluations, you'll want to:
Apply your check to the generated responses
Review the results to understand each model's performance
Compare the models based on the criteria most important to your use case
Step 5: Interpreting the Results
When reviewing your comparison results:
Look for patterns in where each model excels or struggles
Consider how the performance aligns with your specific needs
Determine if the differences are significant enough to impact your decision
Think about any trade-offs between performance and other factors (cost, latency, etc.)
Remember, the goal is not just to find the "best" model, but the one that best fits your unique requirements and constraints.

By using Okareo to create custom scenarios and checks, you can make informed decisions about which language model is best suited for your specific tasks and industry needs.



