arxiv_ir 95% Match Research Paper Information Retrieval Researchers,Search Engine Developers,Machine Learning Engineers,AI Evaluation Specialists 1 month ago

TRUE: A Reproducible Framework for LLM-Driven Relevance Judgment in Information Retrieval

large-language-models › evaluation

📄 Abstract

Abstract: LLM-based relevance judgment generation has become a crucial approach in advancing evaluation methodologies in Information Retrieval (IR). It has progressed significantly, often showing high correlation with human judgments as reflected in LLMJudge leaderboards \cite{rahmani2025judging}. However, existing methods for relevance judgments, rely heavily on sensitive prompting strategies, lacking standardized workflows for generating reliable labels. To fill this gap, we reintroduce our method, \textit{Task-aware Rubric-based Evaluation} (TRUE), for relevance judgment generation. Originally developed for usefulness evaluation in search sessions, we extend TRUE to mitigate the gap in relevance judgment due to its demonstrated effectiveness and reproducible workflow. This framework leverages iterative data sampling and reasoning to evaluate relevance judgments across multiple factors including intent, coverage, specificity, accuracy and usefulness. In this paper, we evaluate TRUE on the TREC DL 2019, 2020 and LLMJudge datasets and our results show that TRUE achieves strong performance on the system-ranking LLM leaderboards. The primary focus of this work is to provide a reproducible framework for LLM-based relevance judgments, and we further analyze the effectiveness of TRUE across multiple dimensions.

Key Contributions

This paper reintroduces and extends the TRUE framework for reproducible LLM-driven relevance judgment generation in Information Retrieval. TRUE addresses the limitations of sensitive prompting strategies by employing iterative data sampling and reasoning across multiple factors (intent, coverage, specificity, accuracy, usefulness), aiming to produce reliable and standardized labels that correlate well with human judgments.

Business Value

Enables more reliable and standardized evaluation of search and IR systems, leading to better product development and performance optimization. Facilitates reproducible research in the field.

Paper Metadata

Innovation Type

Methodology and Framework

Deployment Feasibility

High. The framework is designed to be implemented using existing LLMs and standard IR evaluation pipelines.

Limitations Addressed

Lack of standardization and reproducibility in LLM-based relevance judgments.,High sensitivity of LLM judgments to specific prompting techniques.,Difficulty in aligning LLM judgments with human judgments.,Need for a systematic workflow for generating evaluation labels.

Performance Gains

Demonstrated effectiveness in mitigating relevance judgment gaps and achieving high correlation with human judgments.

Technical Tags

LLM-based Relevance JudgmentInformation Retrieval (IR)Task-aware Rubric-based Evaluation (TRUE)Reproducible FrameworkPrompting StrategiesIterative Data SamplingLLMJudgeRelevance Judgment GenerationHuman JudgmentsPrompt Engineering

Research Topics

Information Retrieval EvaluationNatural Language ProcessingMachine Learning EvaluationAI AlignmentReproducibility in AI

Methods & Architectures

Task-aware Rubric-based Evaluation (TRUE)Iterative data samplingReasoning across multiple factors (intent, coverage, specificity, accuracy, usefulness)Prompting strategies

Applications & Tasks

Information Retrieval Systems Search Engines Recommender Systems Academic Research Lack of standardized workflows for LLM-based relevance judgmentsSensitivity of judgments to prompting strategiesNeed for reliable and reproducible evaluation metricsBridging the gap between LLM judgments and human judgments Generating reliable relevance judgmentsEvaluating information retrieval systemsEnsuring reproducibility in IR evaluationImproving LLM-based evaluation methodologies

Datasets & Benchmarks

Benchmarks

LLMJudge leaderboards (mentioned as context)

Correlation with human judgmentsIntentCoverageSpecificityAccuracyUsefulness

Related Fields

Information RetrievalNatural Language ProcessingMachine LearningEvaluation MethodologiesAI Ethics

Keywords

Information RetrievalIR EvaluationLLMRelevance JudgmentTRUE frameworkReproducibilityPromptingLLMJudgeHuman JudgmentEvaluation MetricsNLPMachine Learning

Academic Context

#Information Retrieval Evaluation#Natural Language Processing#Machine Learning Evaluation#AI Alignment#Reproducibility in AI

Commercial Potential

Potential Products

An automated evaluation service for IR systems.A standardized toolkit for LLM-based relevance judgment.

Target Industries

Technology (Search)E-commerceInformation ServicesAcademia

Use Case Examples

Evaluating the performance of a new search algorithm using LLM-generated relevance labels.Benchmarking different LLMs for their ability to generate accurate relevance judgments.

Competitive Edge

Provides a more robust, reproducible, and standardized alternative to ad-hoc prompting methods for LLM-based relevance judgments.

Market Opportunity

Significant market for IR evaluation tools and services.

Revenue Models

Licensing of the frameworkconsulting servicesor integration into commercial evaluation platforms.

Resource Requirements

Compute Needs

Moderate, depending on the LLMs used for judgment generation and the size of the dataset.

Data Requirements

Requires query-document pairs and potentially human judgments for calibration.

Deployment Constraints

Reliance on the quality and capabilities of the underlying LLMs.,Potential for biases inherited from the LLMs or training data.

Scalability

Scalable to large datasets, provided sufficient computational resources for LLM inference.

Regulatory Considerations

N/A

Production Readiness

Maturity Level

Research/Framework Development

Time to Market

1-2 years for integration into existing IR evaluation platforms.

Patent Potential

Low (Methodology focused)

View Full Paper Back to Papers