query
and the context
, with values ranging between 0 and 1, where higher scores indicate better precision.
Required Args
query
: User Querycontext
: List of retrieved contextexpected_response
: Expected LLM Responsegpt-4-1106-preview
Sample Code:
query
and context
Required Args
query
: User Querycontext
: List of retrieved contextgpt-4-1106-preview
Sample Code:
expected_response
and the retrieved context
To estimate context recall from the ground truth answer, each sentence in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all sentences in the ground truth answer should be attributable to the retrieved context.
Required Args
query
: User Querycontext
: List of retrieved contextexpected_response
: Expected LLM Responsegpt-4-1106-preview
Sample Code:
query
: User Querycontext
: List of retrieved context your LLM response should be faithful toresponse
: The LLM generated responsegpt-4-1106-preview
Sample Code:
response
is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information. This metric is computed using the query
and the LLM generated response
.
An answer is deemed relevant when it directly and appropriately addresses the original question. Importantly, our assessment of answer relevance does not consider factuality but instead penalizes cases where the answer lacks completeness or contains redundant details. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity between these generated questions and the original question is measured. The underlying idea is that if the generated answer accurately addresses the initial question, the LLM should be able to generate questions from the answer that align with the original question.
Required Args
query
: User Querycontext
: List of retrieved contextresponse
: The LLM generated responsegpt-4-1106-preview
Sample Code:
expected_response
) and the LLM generated response
, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.
Measuring the semantic similarity between answers can offer valuable insights into the quality of the generated response. This evaluation utilizes a cross-encoder model to calculate the semantic similarity score.
Required Args
response
: The LLM generated responseexpected_response
: Expected LLM Responsegpt-4-1106-preview
Sample Code:
response
and the ground truth expected_response
, signifying better correctness.
Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score.
Required Args
query
: User Queryresponse
: The LLM generated responseexpected_response
: Expected LLM Responsegpt-4-1106-preview
Sample Code:
response
presents ideas, information, or arguments in a logical and organized manner.
Required Args
query
: User Querycontext
: List of retrieved contextresponse
: The LLM generated responseexpected_response
: Expected LLM Responsegpt-4-1106-preview
Sample Code:
response
conveys information or ideas clearly and efficiently, without unnecessary or redundant details.
Required Args
query
: User Querycontext
: List of retrieved contextresponse
: The LLM generated responseexpected_response
: Expected LLM Responsegpt-4-1106-preview
Sample Code:
response
to harm, deceive, or exploit users.
Required Args
query
: User Querycontext
: List of retrieved contextresponse
: The LLM generated responseexpected_response
: Expected LLM Responsegpt-4-1106-preview
Sample Code:
response
to cause harm to individuals, groups, or society at large.
Required Args
query
: User Querycontext
: List of retrieved contextresponse
: The LLM generated responseexpected_response
: Expected LLM Responsegpt-4-1106-preview
Sample Code:
RagasLoader