Traditional evaluation metrics like BLEU and ROUGE have some value, but they also have major limitations:Documentation Index
Fetch the complete documentation index at: https://docs.athina.ai/llms.txt
Use this file to discover all available pages before exploring further.
- They require a reference to compare against: While you may have such ground truth data in your development dataset, you will never have this in production.
- Traditional metrics will not offer any reasoning capabilities Most developers are now using LLMs for much more complex use cases than can be evaluated by traditional methods.
- Can perform complex and nuanced tasks that include reasoning capabilities
- Come a lot closer to human-in-the-loop level of accuracy