Which evaluations to use for RAG applications?

See this post for a step-by-step guide and video on how to use Athina IDE to measure retrieval accuracy in RAG applications: Measure Retrieval Accuracy Using Athina IDE

Common Failures in RAG-based LLM apps

RAG-based LLM apps are great, but there are always a lot of kinks and imperfections to iron out. Here are some common ones:

Bad retrieval

Retrieved information is not aligned with ground truth (Context Recall)
Retrievals are present but they are not ranked high (Context Precision)
Retrieved information doesn’t have enough information to answer query (Context Sufficiency)
Retrieved information is not relevant to the query (Context Relevancy)

Bad outputs

Response says something that cannot be inferred from context (Faithfulness)
Response has many sentences that were not grounded to context. (Groundedness)
Conversation / chat has messages that are not coherent given the previous messages. (Conversation Coherence))
Some other criteria… (Custom Evaluation)

How to detect such issues

Just plug in the evaluators you need and run the evals on your dataset.

import os
from athina import evals
from athina.loaders import Loader
from athina.keys import OpenAiApiKey
from athina.runner.run import EvalRunner
from athina.datasets import yc_query_mini
import pandas as pd

from dotenv import load_dotenv
load_dotenv()

OpenAiApiKey.set_key(os.getenv('OPENAI_API_KEY'))

# Load a dataset from list of dicts
raw_data = yc_query_mini.data
dataset = Loader().load_dict(raw_data)

# View dataset in a dataframe
pd.DataFrame(dataset)

# Define evaluation suite
model = "gpt-4-turbo-preview"
eval_suite = [
    evals.RagasAnswerCorrectness(model=model),
    evals.RagasContextPrecision(model=model),
    evals.RagasContextRelevancy(model=model),
    evals.RagasContextRecall(model=model),
    evals.ContextContainsEnoughInformation(model=model),
    evals.RagasFaithfulness(model=model),
    evals.Faithfulness(model=model),
    evals.Groundedness(model=model),
    evals.DoesResponseAnswerQuery(model=model)
]

# Run the evaluation suite
batch_eval_result = EvalRunner.run_suite(
    evals=eval_suite,
    data=dataset,
    max_parallel_evals=8
)
batch_eval_result

You can run these evaluations in a python notebook, and view results in a dataframe like this: Example Notebook on Github

Getting Started

Prompts

Datasets

Evals

Experiments

Flows

Which evaluations to use for RAG applications?

Common Failures in RAG-based LLM apps

Bad retrieval

Bad outputs

How to detect such issues

Getting Started

Prompts

Datasets

Evals

Experiments

Flows

​Common Failures in RAG-based LLM apps

​Bad retrieval

​Bad outputs

​How to detect such issues

Common Failures in RAG-based LLM apps

Bad retrieval

Bad outputs

How to detect such issues