Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper AI Researchers,ML Engineers,Developers of LLM evaluation tools,AI Safety Researchers 3 days ago

Do LLM Evaluators Prefer Themselves for a Reason?

large-language-models › evaluation
📄 Abstract

Abstract: Large language models (LLMs) are increasingly used as automatic evaluators in applications like benchmarking, reward modeling, and self-refinement. Prior work highlights a potential self-preference bias where LLMs favor their own generated responses, a tendency often intensifying with model size and capability. This raises a critical question: Is self-preference harmful, or does it simply reflect the genuinely higher-quality outputs of stronger models? Answering this has been difficult as previous studies relied primarily on subjective tasks. These tasks lack an objective ground truth, meaning that either preference can be reasonably justified. To address this ambiguity, we investigate self-preference using verifiable benchmarks (mathematical reasoning, factual knowledge, code generation) that allow objective ground-truth assessment. This enables us to distinguish harmful self-preference (favoring objectively worse responses) from legitimate self-preference (favoring genuinely superior ones). We conduct large-scale experiments under controlled evaluation conditions across diverse model families (e.g., Llama, Qwen, Gemma, Mistral, Phi, GPT, DeepSeek). Our findings reveal three key insights: (1) While stronger models exhibit greater self-preference, much of this preference aligns with objectively superior performance, indicating stronger models prefer themselves mostly legitimately. (2) Harmful self-preference persists when evaluator models err as generators, and stronger models display more pronounced harmful self-preference when they do err. This suggests stronger models struggle more to recognize when they are wrong. (3) Inference-time scaling strategies, such as generating a long Chain-of-Thought before evaluation, effectively reduce harmful self-preference. These results provide a more nuanced understanding of LLM-based evaluation and practical insights for improving its reliability.
Authors (5)
Wei-Lin Chen
Zhepei Wei
Xinyu Zhu
Shi Feng
Yu Meng
Submitted
April 4, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

Investigates LLM self-preference bias using verifiable benchmarks (math, facts, code) to distinguish harmful bias from genuine quality preference. This approach allows for objective ground-truth assessment, unlike previous subjective tasks, providing a clearer understanding of whether LLM evaluators favor themselves for valid reasons or due to bias.

Business Value

Ensures the reliability and fairness of LLM-based evaluation systems, crucial for developing trustworthy AI, improving model training (e.g., RLHF), and preventing self-reinforcing biases in AI development.