Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: LLM-based relevance judgment generation has become a crucial approach in
advancing evaluation methodologies in Information Retrieval (IR). It has
progressed significantly, often showing high correlation with human judgments
as reflected in LLMJudge leaderboards \cite{rahmani2025judging}. However,
existing methods for relevance judgments, rely heavily on sensitive prompting
strategies, lacking standardized workflows for generating reliable labels. To
fill this gap, we reintroduce our method, \textit{Task-aware Rubric-based
Evaluation} (TRUE), for relevance judgment generation. Originally developed for
usefulness evaluation in search sessions, we extend TRUE to mitigate the gap in
relevance judgment due to its demonstrated effectiveness and reproducible
workflow. This framework leverages iterative data sampling and reasoning to
evaluate relevance judgments across multiple factors including intent,
coverage, specificity, accuracy and usefulness. In this paper, we evaluate TRUE
on the TREC DL 2019, 2020 and LLMJudge datasets and our results show that TRUE
achieves strong performance on the system-ranking LLM leaderboards. The primary
focus of this work is to provide a reproducible framework for LLM-based
relevance judgments, and we further analyze the effectiveness of TRUE across
multiple dimensions.
Key Contributions
This paper reintroduces and extends the TRUE framework for reproducible LLM-driven relevance judgment generation in Information Retrieval. TRUE addresses the limitations of sensitive prompting strategies by employing iterative data sampling and reasoning across multiple factors (intent, coverage, specificity, accuracy, usefulness), aiming to produce reliable and standardized labels that correlate well with human judgments.
Business Value
Enables more reliable and standardized evaluation of search and IR systems, leading to better product development and performance optimization. Facilitates reproducible research in the field.