There’s a very common problem teams face when considering prompt or model:

Is the new prompt / model actually better or worse?

One solution that many teams seem to have landed upon is to compare the old response and the new response side-by-side.

Often, they have a human annotator, domain expert, or team member score which one is better.

In fact, even teams like OpenAI follow similar processes as part of their evaluation process.

But the challenge is that human review isn’t scalable or cost-effective to do regularly.

It takes many hours of valuable time, and can still be subjective.

Well, it turns out that LLMs can also do such comparisons with similar precision as humans as long as you very clearly define the grading criteria.

Pairwise Evaluation Workflow on Athina

Setup time: 5 mins

  • Generate 2 sets of responses using dynamic columns.
  • Create a Custom Prompt evaluation with a Numeric output type (1 or 2)
    • Make sure you clearly define the scoring criteria
  • View results side by side along with evaluation metrics.
  • Spot check as required.

Then you can change the prompt / model configurations in the dynamic columns and re-run the evaluations to iterate rapidly.