The best tool we have for handling reasoning tasks on large pieces of text are LLMs.

LLM evaluators can perform complex and nuanced tasks that require human-like reasoning.

But why would LLM evaluation work if my own inference failed?

TLDR: Classification (pass/fail) is usually a much easier task than your generation

The evaluation task is very different from the task you are asking your LLM to perform.

Your application’s inference task might be quite complex. It likely includes a lot of conditions, rules, and data needed to provide a good answer. It might be generating a long response with a fair degree of complexity.

On the contrary, the LLM evaluation task is very simple. The LLM is being asked to solve a much simpler question, which is usually fairly easy for powerful LLM models.

For example, the LLM is performing a simple task like Does the provided context {context} contain the answer to this question: {query}

Since the eval is performing a much simpler task, it can be expected to work consistently most of the time We can also run the same grading prompt multiple times to detect flakiness and discard flaky results.