While evaluating the accuracy of the Language Model (LLM) response is crucial, it is really important to measure the accuracy of the retrieval step separately.

This helps in identifying how effective the retrieval step is in providing relevant documents to the LLM for generating a response.

What You Will Learn in this Guide

In this post, we’ll walk you through:

  • Setting up a basic RAG application using Langchain and Chroma
  • Loading a dataset into Athina
  • Evaluating retrieval accuracy using various metrics
  • Leveraging dynamic columns in Athina IDE
  • Exploring further steps to enhance your RAG application

Video: How to Measure Retrieval Accuracy in RAG Applications

Set up a RAG application with Langchain + Chroma

We’ll start by setting up a simple RAG application using Langchain and Chroma.

For this example, we will use the first Harry Potter book, “The Sorcerer’s Stone”, and chunk it into segments of 512 characters with a 20-character overlap.

Install the Required Dependencies

pip install athina-client chromadb langchain langchain-openai langchain-community langchain-chroma

Import the Required Dependencies and Configure API Keys

import os
import json
import pandas as pd
from dotenv import load_dotenv
from athina_client.keys import AthinaApiKey
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from athina_client.datasets import Dataset

# Load API keys
load_dotenv()

# You can get an Athina API key by signing up at https://app.athina.ai
AthinaApiKey.set_key(os.getenv('ATHINA_API_KEY'))

Load the Data and Chunk the Document

# Load the data
loader = TextLoader('data/harry_potter_sorcerers_stone.txt')
data = loader.load()

# Split the data into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
all_splits = text_splitter.split_documents(data)
valid_documents = [doc for doc in all_splits if not_empty(doc.page_content)]

# Store splits in vector store
vectorstore = Chroma.from_documents(documents=valid_documents, embedding=OpenAIEmbeddings())

Create your test dataset

Retrieve Relevant Documents for a Set of Queries

Let’s create a dataset with a set of queries and retrieve relevant documents from the vector store.

Here, we are using similarity search to retrieve the most similar documents for each query.

dataset_rows = []
queries = [
  "What is the name of Harry Potter's aunt?",
  "What is the address of the Dursley house?",
  # Add more questions here
]

for query in queries:
    # Retrieve relevant documents from vector store
    relevant_documents = vectorstore.similarity_search(query, k=4)
    dataset_rows.append({
        "query": query,
        "context": [page.page_content for page in relevant_documents]
    })

# Print the dataframe to see the dataset
pd.DataFrame(dataset_rows)

Loading the Dataset into Athina

Now, let’s load our dataset into Athina.

# Create dataset using Athina SDK
dataset = Dataset.create(
    name=dataset_name,
    description=dataset_description,
    rows=dataset_rows
)

print(f"https://app.athina.ai/develop/{dataset.id}")

Evaluating the Retrieval

Now that we have our dataset loaded into Athina, we can evaluate the retrieval accuracy using various metrics.

Choosing the Evaluation Metrics

To measure the retrieval accuracy for this dataset, we have used an LLM-as-a-judge or Custom Prompt:

Here is the prompt that was used:

You are an expert at evaluation.

Determine if the context provided contains enough information to answer the given query.

### QUERY
{{query}}

### CONTEXT
{{retrieved_doc}}

This metric evaluates if the retrieved documents have enough information to answer the query.

Running an evaluation in Athina IDE

See this video below to learn how to run evaluations in Athina IDE (without writing any code).

Running Experiments in Athina IDE with Dynamic Columns

Dynamic columns in Athina IDE allow you to run more experiments on the dataset.

For example, you can:

  • Extract individual documents from the context (using the Code Execution dynamic column) and evaluate the accuracy of the 1st, 2nd, 3rd chunks separately to see how good the ranking is.

  • Generate a summary of the retrieved documents and evaluate its accuracy (using the Run Prompt dynamic column).

  • Generate LLM responses (using the Run Prompt dynamic column)

  • Try rephrasing the query and evaluate the retrieval accuracy.

  • Compare this dataset with another dataset in Athina IDE to see the responses and eval scores side-by-side. This can be useful if you are trying multiple different retrieval strategies and want to compare their performance.

By leveraging these advanced features, you can continuously refine and improve the retrieval accuracy of your RAG applications.

You can book a call with us to learn more about how Athina can help your team build AI applications faster.