Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
This paper proposes 'quantitative LLM judges' that align the evaluation scores of existing LLM judges to human preferences using regression models. By training on the LLM's rationale and score, these quantitative judges improve predictive power and efficiency compared to supervised fine-tuning, especially when human feedback is limited. The framework is versatile, demonstrated by four judges for different feedback types.
Enables more reliable and efficient evaluation of AI-generated content and responses, leading to better quality control and faster iteration cycles in developing LLM-based products. This can reduce costs associated with human evaluation.