These evals are very useful for most RAG style applications They check for 3 things:Documentation Index
Fetch the complete documentation index at: https://docs.athina.ai/llms.txt
Use this file to discover all available pages before exploring further.
- Context Contains Enough Information: Does the retrieved context contains enough information to answer the query.
- Faithfulness: Is the response faithful to the context. (Unfaithful responses are correlated with hallucinations)
- Does Response Answer Query: Does the response answer the user’s query. Checks for relevance and answer completeness.
Context Contains Enough Information
Docs | GithubOne of the most common causes for a bad output is bad input. For RAG applications, this usually means a bad retrieval. Typically for retrieval, you might do a cosine similarity search to the user’s query. However, similar ≠ relevance. Often, your retrieved data might not be relevant to the user’s query. Sometimes, it might be relevant, but might not contain the answer to the user’s query. We use an LLM grader (GPT-4) to figure out if the retrieved data is relevant and has enough information to answer the query.
- Query: How much equity does Y Combinator take?
- Retrieved Context: YC invests $500,000 in 200 startups twice a year.
Faithfulness
Docs | GithubAnother common problem with RAG applications is when the response is not “faithful” to the context. This is often the cause of “Hallucinations”. The LLM might use its pretrained knowledge to generate an answer. But for most RAG apps, you want to constrain it to the context you are providing it (since you know it to be true).
- Query: YC invests $500,000 in 200 startups twice a year.
- Retrieved Context: YC takes 5-7% equity.
Answer Completeness
Docs | GithubThis is a good eval for nearly any Q&A type application. This can help you check if:
- Query: Which spaceship landed on the moon first?
- Retrieved Context: Neil Armstrong was the first man to set foot on the moon in 1969
- Response is irrelevant or tangential to the query.
- Response does not sufficiently answer the query.