See this post for a step-by-step guide and video on how to use Athina IDE to measure retrieval accuracy in RAG applications: Measure Retrieval Accuracy Using Athina IDE

Common Failures in RAG-based LLM apps

RAG-based LLM apps are great, but there are always a lot of kinks and imperfections to iron out.

Here are some common ones:

Bad retrieval

Bad outputs

How to detect such issues

Just plug in the evaluators you need and run the evals on your dataset.

import os
from athina import evals
from athina.loaders import Loader
from athina.keys import OpenAiApiKey
from athina.runner.run import EvalRunner
from athina.datasets import yc_query_mini
import pandas as pd

from dotenv import load_dotenv
load_dotenv()

OpenAiApiKey.set_key(os.getenv('OPENAI_API_KEY'))

# Load a dataset from list of dicts
raw_data = yc_query_mini.data
dataset = Loader().load_dict(raw_data)

# View dataset in a dataframe
pd.DataFrame(dataset)

# Define evaluation suite
model = "gpt-4-turbo-preview"
eval_suite = [
    evals.RagasAnswerCorrectness(model=model),
    evals.RagasContextPrecision(model=model),
    evals.RagasContextRelevancy(model=model),
    evals.RagasContextRecall(model=model),
    evals.ContextContainsEnoughInformation(model=model),
    evals.RagasFaithfulness(model=model),
    evals.Faithfulness(model=model),
    evals.Groundedness(model=model),
    evals.DoesResponseAnswerQuery(model=model)
]

# Run the evaluation suite
batch_eval_result = EvalRunner.run_suite(
    evals=eval_suite,
    data=dataset,
    max_parallel_evals=8
)
batch_eval_result

You can run these evaluations in a python notebook, and view results in a dataframe like this: Example Notebook on Github