Frequently Asked Questions
The best tool we have for handling reasoning tasks on large pieces of text are LLMs.
LLM evaluators can perform complex and nuanced tasks that require human-like reasoning.
But why would LLM evaluation work if my own inference failed?
TLDR: Classification (pass/fail) is usually a much easier task than your generation
The evaluation task is very different from the task you are asking your LLM to perform.
Your application’s inference task might be quite complex. It likely includes a lot of conditions, rules, and data needed to provide a good answer. It might be generating a long response with a fair degree of complexity.
On the contrary, the LLM evaluation task is very simple. The LLM is being asked to solve a much simpler question, which is usually fairly easy for powerful LLM models.
For example, the LLM is performing a simple task like Does the provided context {context} contain the answer to this question: {query}
Since the eval is performing a much simpler task, it can be expected to work consistently most of the time We can also run the same grading prompt multiple times to detect flakiness and discard flaky results.
Yes, you can specify your own model for running evals. However, keep the following in mind.
- From our testing, evals work best with OpenAI models.
- If your evaluation task is complex, use a powerful model like
gpt-4o
. If it’s simple, use a smaller model likegpt-3.5-turbo
orllama-3-8b
.
Currently, we support all the major public models, as well as custom models.
Evals run with your LLM keys. You can configure the keys, and track the cost of evals in the settings page.
There are 2 controls that help you manage the cost of LLM evals in Athina.
You can configure max evals per month in the settings. For example, if your max evals per month is set at 30k per month, we will sample 1000 logs per day.
You can configure a sampling rate in the settings. For example, if your sampling rate is set at 10%, we will sample 10% of all logs (while still respecting the max evals limit).
Traditional evaluation metrics like BLEU and ROUGE have some value, but they also have major limitations:
- They require a reference to compare against: While you may have such ground truth data in your development dataset, you will never have this in production.
- Traditional metrics will not offer any reasoning capabilities Most developers are now using LLMs for much more complex use cases than can be evaluated by traditional methods.
In contrast, LLM evaluators:
- Can perform complex and nuanced tasks that include reasoning capabilities
- Come a lot closer to human-in-the-loop level of accuracy
Intuitively, this makes sense. The best tool we have for handling reasoning tasks on large pieces of text are LLMs. So why would you use anything else for evals?
from athina.evals import LlmEvaluator
LlmEvaluator(model="gpt-4", grading_criteria=grading_criteria).run(response)
If you’d like to use a different model, contact us at hello@athina.ai.