The best tool we have for handling reasoning tasks on large pieces of text are LLMs. LLM evaluators can perform complex and nuanced tasks that require human-like reasoning.Documentation Index
Fetch the complete documentation index at: https://docs.athina.ai/llms.txt
Use this file to discover all available pages before exploring further.
But why would LLM evaluation work if my own inference failed?
TLDR: Classification (pass/fail) is usually a much easier task than your
generation
Does the provided context {context} contain the answer to this question: {query}
Since the eval is performing a much simpler task, it can be expected to work consistently most of the time We can also run the same grading prompt multiple times to detect flakiness and discard flaky results.