RAG Evaluators

These evals are very useful for most RAG style applications They check for 3 things:

Context Contains Enough Information: Does the retrieved context contains enough information to answer the query.
Faithfulness: Is the response faithful to the context. (Unfaithful responses are correlated with hallucinations)
Does Response Answer Query: Does the response answer the user’s query. Checks for relevance and answer completeness.

Context Contains Enough Information

Query: How much equity does Y Combinator take?

Retrieved Context: YC invests $500,000 in 200 startups twice a year.

Eval Result - Result: Fail - Explanation: The context mentions that YC invests $500,000 but it does not mention how much equity they take, which is what the query is asking about.

One of the most common causes for a bad output is bad input. For RAG applications, this usually means a bad retrieval. Typically for retrieval, you might do a cosine similarity search to the user’s query. However, similar ≠ relevance. Often, your retrieved data might not be relevant to the user’s query. Sometimes, it might be relevant, but might not contain the answer to the user’s query. We use an LLM grader (GPT-4) to figure out if the retrieved data is relevant and has enough information to answer the query.

Faithfulness

Docs | Github

Query: YC invests $500,000 in 200 startups twice a year.

Retrieved Context: YC takes 5-7% equity.

Eval Result

Result: Fail
Explanation: The response mentions that YC takes 5-7% equity, but this is not mentioned anywhere in the context.

Another common problem with RAG applications is when the response is not “faithful” to the context. This is often the cause of “Hallucinations”. The LLM might use its pretrained knowledge to generate an answer. But for most RAG apps, you want to constrain it to the context you are providing it (since you know it to be true).

Answer Completeness

Docs | Github

Query: Which spaceship landed on the moon first?

Retrieved Context: Neil Armstrong was the first man to set foot on the moon in 1969

Eval Result

Result: Fail
Explanation: The query is asking which spaceship landed on the moon first, but the response only mentions the name of the astronaut, and does not say anything about the name of the spaceship.

This is a good eval for nearly any Q&A type application. This can help you check if:

Response is irrelevant or tangential to the query.
Response does not sufficiently answer the query.

Getting Started

Prompts

Datasets

Evals

Experiments

Flows

RAG Evaluators

Context Contains Enough Information

Faithfulness

Answer Completeness

Getting Started

Prompts

Datasets

Evals

Experiments

Flows

​Context Contains Enough Information

​Faithfulness

​Answer Completeness

Context Contains Enough Information

Faithfulness

Answer Completeness