Step by step pairwise evaluation guide to compare model outputs using Athina AI.
A common key challenge in developing or improving prompts and models is determining whether a new configuration performs better than an existing one. Pairwise evaluation addresses this by comparing two responses side by side based on specific criteria such as relevance, accuracy, or fluency.
Traditionally conducted by human reviewers, this process can be time-consuming, costly, and subjective. Tools like Athina AI automate pairwise evaluation using LLMs, making it faster, scalable, and more efficient. This guide explains what pairwise evaluation is, where it can be used, and how to perform it using Athina AI.
Let’s start by understanding what pairwise evaluation is.
Pairwise evaluation is a method for comparing two outputs from different prompts or models to determine which performs better. This comparison is based on criteria such as relevance, accuracy, or fluency. For example, you can compare responses from an old and a new model to identify improvements.
This method is widely used by AI teams as part of their evaluation processes. While traditionally conducted by human reviewers, pairwise evaluation can also be automated using LLMs, provided the grading criteria are well-defined. Using automated tools like Athina, you can evaluate more efficiently and at a larger scale, with less subjectivity.
Pairwise evaluation is highly versatile and can be applied in various scenarios, including:
By applying pairwise evaluation in these areas, teams can make informed decisions and ensure continuous improvement.
In this guide, we will perform a pairwise evaluation on the Ragas WikiQA dataset, which contains questions, context, and ground truth answers. This dataset is generated using information from Wikipedia pages.
Now let’s see the step-by-step process of creating pairwise evaluation in Athina AI:
Start by creating two sets of responses using two different models as you can see in the following images.
Run Prompt to generate responses:
Output from both models will look something like this:
Next, click on the Evaluate feature, then select Create New Evaluation and choose the Custom Eval option.
Then click on the custom prompt option as you can see below:
Now define your pairwise evaluation prompt. For example, if the model 1 response is better, then return 1, and if the model 2 response is better, then return 2. Here is a sample pairwise evaluation prompt:
Then, run the evaluation to compare each pair of responses based on the defined criteria.
Once the evaluation is complete, go to the SQL Section to view and compare the scores. This analysis will help you determine which model performed better across the dataset.
Use these results to refine your prompts, adjust evaluation criteria, or select the best-performing model.