How to Measure Retrieval Accuracy in RAG Applications Using Athina IDE
While evaluating the accuracy of the Language Model (LLM) response is crucial, it is really important to measure the accuracy of the retrieval step separately.This helps in identifying how effective the retrieval step is in providing relevant documents to the LLM for generating a response.
We’ll start by setting up a simple RAG application using Langchain and Chroma.For this example, we will use the first Harry Potter book, “The Sorcerer’s Stone”, and chunk it into segments of 512 characters with a 20-character overlap.
Import the Required Dependencies and Configure API Keys
Copy
Ask AI
import osimport jsonimport pandas as pdfrom dotenv import load_dotenvfrom athina_client.keys import AthinaApiKeyfrom langchain.embeddings import OpenAIEmbeddingsfrom langchain.vectorstores import Chromafrom langchain.document_loaders import TextLoaderfrom langchain.text_splitter import RecursiveCharacterTextSplitterfrom athina_client.datasets import Dataset# Load API keysload_dotenv()# You can get an Athina API key by signing up at https://app.athina.aiAthinaApiKey.set_key(os.getenv('ATHINA_API_KEY'))
# Load the dataloader = TextLoader('data/harry_potter_sorcerers_stone.txt')data = loader.load()# Split the data into chunkstext_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)all_splits = text_splitter.split_documents(data)valid_documents = [doc for doc in all_splits if not_empty(doc.page_content)]# Store splits in vector storevectorstore = Chroma.from_documents(documents=valid_documents, embedding=OpenAIEmbeddings())
Let’s create a dataset with a set of queries and retrieve relevant documents from the vector store.Here, we are using similarity search to retrieve the most similar documents for each query.
Copy
Ask AI
dataset_rows = []queries = [ "What is the name of Harry Potter's aunt?", "What is the address of the Dursley house?", # Add more questions here]for query in queries: # Retrieve relevant documents from vector store relevant_documents = vectorstore.similarity_search(query, k=4) dataset_rows.append({ "query": query, "context": [page.page_content for page in relevant_documents] })# Print the dataframe to see the datasetpd.DataFrame(dataset_rows)
To measure the retrieval accuracy for this dataset, we have used an LLM-as-a-judge or Custom Prompt:Here is the prompt that was used:
Copy
Ask AI
You are an expert at evaluation.Determine if the context provided contains enough information to answer the given query.### QUERY{{query}}### CONTEXT{{retrieved_doc}}
This metric evaluates if the retrieved documents have enough information to answer the query.
Running Experiments in Athina IDE with Dynamic Columns
Dynamic columns in Athina IDE allow you to run more experiments on the dataset.For example, you can:
Extract individual documents from the context (using the Code Execution dynamic column) and evaluate the accuracy of the 1st, 2nd, 3rd chunks separately to see how good the ranking is.
Generate a summary of the retrieved documents and evaluate its accuracy (using the Run Prompt dynamic column).
Generate LLM responses (using the Run Prompt dynamic column)
Try rephrasing the query and evaluate the retrieval accuracy.
Compare this dataset with another dataset in Athina IDE to see the responses and eval scores side-by-side. This can be useful if you are trying multiple different retrieval strategies and want to compare their performance.
By leveraging these advanced features, you can continuously refine and improve the retrieval accuracy of your RAG applications.You can book a call with us to learn more about how Athina can help your team build AI applications faster.