📄 Abstract
Abstract: Large language models (LLMs) are increasingly used as automatic evaluators in
applications like benchmarking, reward modeling, and self-refinement. Prior
work highlights a potential self-preference bias where LLMs favor their own
generated responses, a tendency often intensifying with model size and
capability. This raises a critical question: Is self-preference harmful, or
does it simply reflect the genuinely higher-quality outputs of stronger models?
Answering this has been difficult as previous studies relied primarily on
subjective tasks. These tasks lack an objective ground truth, meaning that
either preference can be reasonably justified. To address this ambiguity, we
investigate self-preference using verifiable benchmarks (mathematical
reasoning, factual knowledge, code generation) that allow objective
ground-truth assessment. This enables us to distinguish harmful self-preference
(favoring objectively worse responses) from legitimate self-preference
(favoring genuinely superior ones). We conduct large-scale experiments under
controlled evaluation conditions across diverse model families (e.g., Llama,
Qwen, Gemma, Mistral, Phi, GPT, DeepSeek). Our findings reveal three key
insights: (1) While stronger models exhibit greater self-preference, much of
this preference aligns with objectively superior performance, indicating
stronger models prefer themselves mostly legitimately. (2) Harmful
self-preference persists when evaluator models err as generators, and stronger
models display more pronounced harmful self-preference when they do err. This
suggests stronger models struggle more to recognize when they are wrong. (3)
Inference-time scaling strategies, such as generating a long Chain-of-Thought
before evaluation, effectively reduce harmful self-preference. These results
provide a more nuanced understanding of LLM-based evaluation and practical
insights for improving its reliability.
Authors (5)
Wei-Lin Chen
Zhepei Wei
Xinyu Zhu
Shi Feng
Yu Meng
Key Contributions
Investigates LLM self-preference bias using verifiable benchmarks (math, facts, code) to distinguish harmful bias from genuine quality preference. This approach allows for objective ground-truth assessment, unlike previous subjective tasks, providing a clearer understanding of whether LLM evaluators favor themselves for valid reasons or due to bias.
Business Value
Ensures the reliability and fairness of LLM-based evaluation systems, crucial for developing trustworthy AI, improving model training (e.g., RLHF), and preventing self-reinforcing biases in AI development.