Keep up with the latest and greatest research techniques to add more evaluation metrics / improve reliability
Cost Management
If you need to use an LLM for evaluation, it can get pretty expensive. Imagine running 5-10 evaluations per production log. The evaluation costs could he higher than the actual task costs!Solution: Implement sampling + cost tracking mechanism
Configurability
You need:
a library of evals
options to configure the Evals
options for creating custom evals
swappable models / providers
integrations with other eval libraries like Ragas, Guardrails, etc
Automation
Needless to say, running evals in production should be automated and continuous. That poses a number of challenges at scale.This means:
You need to scale your evaluation infrastructure to meet your logging throughput
You need a way to configure evals and store configuration
You need a way to select which evals should run on which prompts
You need mechanisms to handle rate limiting
You need the eval to be run using swappable models / providers
You need a way to run a newly configured evaluation against old logs
Solution: Build an orchestration layer for evaluationAthina’s eval orchestration layer manages eval configurations, sampling, filtering, deduping, rate limiting, switching between different model providers, alerting, and calculating granular analytics to provide a complete evaluation platform.You can run Evals during development, in CI / CD, as real-time guardrails, or continuously in production.
Support for different models, architectures, and traces
Say your team wants to switch from OpenAI to Gemini.Suppose you add a new step to your LLM pipeline.Maybe you’re building an agent, and need to support complex traces?Maybe you switched from Langchain to Llama Index?Maybe you’re building an chat application and need special evals for that?Can your logging and evaluation infrastructure support this?Solution: You need a normalization layer that is separate from your evaluation infrastructure.
Inspect and debug complex traces and chats
Interpretation & Analytics
What do you do with the eval metrics that were calculated? Ideally, you want to be able to:
Measure overall app performance.
Measure retrieval quality
Measure usage like token counts, cost, response times
Measure safety issues like PII leakage or prompt injection attacks.
We spent a lot of time working through these problems so you don’t need a dedicated team for this. You can see a demo video here.Website: Athina AI (Try our sandbox ).Sign Up for Athina.Github : Run any of our 40+ open source evaluations using our Python SDK to measure your LLM app.
Assistant
Responses are generated using AI and may contain mistakes.