How to Measure Retrieval Accuracy in RAG Applications Using Athina IDE
While evaluating the accuracy of the Language Model (LLM) response is crucial, it is really important to measure the accuracy of the retrieval step separately.
This helps in identifying how effective the retrieval step is in providing relevant documents to the LLM for generating a response.
What You Will Learn in this Guide
In this post, we’ll walk you through:
- Setting up a basic RAG application using Langchain and Chroma
- Loading a dataset into Athina
- Evaluating retrieval accuracy using various metrics
- Leveraging dynamic columns in Athina IDE
- Exploring further steps to enhance your RAG application
Video: How to Measure Retrieval Accuracy in RAG Applications
Set up a RAG application with Langchain + Chroma
We’ll start by setting up a simple RAG application using Langchain and Chroma.
For this example, we will use the first Harry Potter book, “The Sorcerer’s Stone”, and chunk it into segments of 512 characters with a 20-character overlap.
Install the Required Dependencies
Import the Required Dependencies and Configure API Keys
Load the Data and Chunk the Document
Create your test dataset
Retrieve Relevant Documents for a Set of Queries
Let’s create a dataset with a set of queries and retrieve relevant documents from the vector store.
Here, we are using similarity search to retrieve the most similar documents for each query.
Loading the Dataset into Athina
Now, let’s load our dataset into Athina.
Evaluating the Retrieval
Now that we have our dataset loaded into Athina, we can evaluate the retrieval accuracy using various metrics.
Choosing the Evaluation Metrics
To measure the retrieval accuracy for this dataset, we have used an LLM-as-a-judge or Custom Prompt:
Here is the prompt that was used:
This metric evaluates if the retrieved documents have enough information to answer the query.
Running an evaluation in Athina IDE
See this video below to learn how to run evaluations in Athina IDE (without writing any code).
Running Experiments in Athina IDE with Dynamic Columns
Dynamic columns in Athina IDE allow you to run more experiments on the dataset.
For example, you can:
-
Extract individual documents from the context (using the Code Execution dynamic column) and evaluate the accuracy of the 1st, 2nd, 3rd chunks separately to see how good the ranking is.
-
Generate a summary of the retrieved documents and evaluate its accuracy (using the Run Prompt dynamic column).
-
Generate LLM responses (using the Run Prompt dynamic column)
-
Try rephrasing the query and evaluate the retrieval accuracy.
-
Compare this dataset with another dataset in Athina IDE to see the responses and eval scores side-by-side. This can be useful if you are trying multiple different retrieval strategies and want to compare their performance.
By leveraging these advanced features, you can continuously refine and improve the retrieval accuracy of your RAG applications.
You can book a call with us to learn more about how Athina can help your team build AI applications faster.