arxiv_cl 95% Match Research Paper IR Researchers,NLP Researchers,ML Engineers,Developers of RAG systems 1 week ago

Redefining Retrieval Evaluation in the Era of LLMs

large-language-models › evaluation

📄 Abstract

Abstract: Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones. Building on this foundation, we propose UDCG (Utility and Distraction-aware Cumulative Gain), a metric using an LLM-oriented positional discount to directly optimize the correlation with the end-to-end answer accuracy. Experiments on five datasets and six LLMs demonstrate that UDCG improves correlation by up to 36% compared to traditional metrics. Our work provides a critical step toward aligning IR evaluation with LLM consumers and enables more reliable assessment of RAG components

Authors (5)

Giovanni Trappolini

Florin Cuconasu

Simone Filice

Yoelle Maarek

Fabrizio Silvestri

Submitted

October 24, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper argues that traditional IR metrics (nDCG, MAP, MRR) are misaligned with Retrieval Augmented Generation (RAG) systems because LLMs process documents differently than humans. It proposes a utility-based annotation schema and new metrics (like UDCG) that account for both the positive contribution of relevant passages and the negative impact of distracting ones.

Business Value

Enables more accurate and meaningful evaluation of RAG systems, leading to better search and generation quality, and more reliable AI-powered information access tools.

Paper Metadata

Innovation Type

Evaluation Methodology

Deployment Feasibility

High, as it provides a framework for evaluating existing and future RAG systems.

Limitations Addressed

The breakdown of traditional IR metrics' assumptions (sequential examination, diminishing attention) when applied to RAG systems, and their failure to account for the negative impact of distracting but related documents.

Performance Gains

The paper focuses on improving evaluation metrics, not direct performance gains of a system.

Technical Tags

retrieval evaluationLLMsRAG systemsinformation retrievalnDCGMAPMRRutility-based metricsdocument relevancedistractionposition discount

Research Topics

Information RetrievalNatural Language ProcessingMachine LearningLarge Language ModelsEvaluation Metrics

Methods & Architectures

Critique of traditional IR metricsDevelopment of utility-based annotation schemaProposal of new metrics (UDCG) Large Language Models (LLMs)Retrieval Augmented Generation (RAG) systems

Applications & Tasks

Information Retrieval Systems Question Answering Document Summarization Knowledge Management Misalignment of traditional IR metrics with RAG systemsLLM's non-sequential document processingImpact of irrelevant documentsHuman vs. machine relevance assessment Redefining retrieval evaluation for RAGQuantifying document utility and distractionDeveloping new evaluation metrics

Related Fields

Search EnginesRecommender SystemsKnowledge DiscoveryAI Evaluation

Keywords

retrieval evaluationRAGLLMinformation retrievalmetricsnDCGutilitydistractiondocument relevanceposition discountevaluationsearch

Academic Context

#Information Retrieval#Natural Language Processing#Machine Learning#Large Language Models#Evaluation Metrics

Commercial Potential

Potential Products

Improved evaluation suites for RAG systemsTools for assessing search result quality

Target Industries

TechnologySearch EnginesKnowledge ManagementAI Development

Use Case Examples

Evaluating the effectiveness of a RAG-based question-answering systemComparing different retrieval strategies for LLM augmentationDeveloping better metrics for search result ranking

Competitive Edge

Offers a fundamentally new perspective on retrieval evaluation tailored for the LLM era, addressing critical shortcomings of traditional metrics.

Market Opportunity

Rapid growth of RAG systems creates a strong need for better evaluation.

Revenue Models

Licensing of evaluation toolsconsulting services.

Resource Requirements

Compute Needs

Low, for applying the evaluation methodology.

Data Requirements

Requires annotated datasets for evaluating RAG systems based on the proposed schema.

Deployment Constraints

Requires careful annotation effort to create datasets for the new metrics.

Scalability

The evaluation methodology is scalable to large retrieval datasets.

Production Readiness

Maturity Level

Research

Time to Market

Immediate for research and development.

View Full Paper Back to Papers