Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 95% Match Research Paper AI researchers,ML engineers,Developers of LLM applications,Product managers 2 weeks ago

Quantitative LLM Judges

large-language-models › evaluation
📄 Abstract

Abstract: LLM-as-a-judge is a framework where a large language model (LLM) evaluates the output of another LLM. While LLMs excel at producing qualitative textual evaluations, they often struggle to predict human preferences and numeric scores. We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to humans in a given domain using regression models. The models are trained to improve the score of the original judge using its rationale and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework is more computationally efficient than supervised fine-tuning and can be more statistically efficient when human feedback is limited, which is expected in practice. We validate these claims empirically on four datasets using two base judges. Our experiments show that quantitative judges can improve the predictive power of existing judges through post-hoc modeling.
Authors (12)
Aishwarya Sahoo
Jeevana Kruthi Karnuthala
Tushar Parmanand Budhwani
Pranchal Agarwal
Sankaran Vaidyanathan
Alexa Siu
+6 more
Submitted
June 3, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

This paper proposes 'quantitative LLM judges' that align the evaluation scores of existing LLM judges to human preferences using regression models. By training on the LLM's rationale and score, these quantitative judges improve predictive power and efficiency compared to supervised fine-tuning, especially when human feedback is limited. The framework is versatile, demonstrated by four judges for different feedback types.

Business Value

Enables more reliable and efficient evaluation of AI-generated content and responses, leading to better quality control and faster iteration cycles in developing LLM-based products. This can reduce costs associated with human evaluation.