Is the new prompt / model actually better or worse?One solution that many teams seem to have landed upon is to compare the old response and the new response side-by-side. Often, they have a human annotator, domain expert, or team member score which one is better. In fact, even teams like OpenAI follow similar processes as part of their evaluation process. But the challenge is that human review isn’t scalable or cost-effective to do regularly. It takes many hours of valuable time, and can still be subjective. Well, it turns out that LLMs can also do such comparisons with similar precision as humans as long as you very clearly define the grading criteria.