arxiv_cl 95% Match Research Paper AI Researchers,ML Engineers,Developers of LLM evaluation tools,AI Safety Researchers 3 days ago

Do LLM Evaluators Prefer Themselves for a Reason?

large-language-models › evaluation

📄 Abstract

Abstract: Large language models (LLMs) are increasingly used as automatic evaluators in applications like benchmarking, reward modeling, and self-refinement. Prior work highlights a potential self-preference bias where LLMs favor their own generated responses, a tendency often intensifying with model size and capability. This raises a critical question: Is self-preference harmful, or does it simply reflect the genuinely higher-quality outputs of stronger models? Answering this has been difficult as previous studies relied primarily on subjective tasks. These tasks lack an objective ground truth, meaning that either preference can be reasonably justified. To address this ambiguity, we investigate self-preference using verifiable benchmarks (mathematical reasoning, factual knowledge, code generation) that allow objective ground-truth assessment. This enables us to distinguish harmful self-preference (favoring objectively worse responses) from legitimate self-preference (favoring genuinely superior ones). We conduct large-scale experiments under controlled evaluation conditions across diverse model families (e.g., Llama, Qwen, Gemma, Mistral, Phi, GPT, DeepSeek). Our findings reveal three key insights: (1) While stronger models exhibit greater self-preference, much of this preference aligns with objectively superior performance, indicating stronger models prefer themselves mostly legitimately. (2) Harmful self-preference persists when evaluator models err as generators, and stronger models display more pronounced harmful self-preference when they do err. This suggests stronger models struggle more to recognize when they are wrong. (3) Inference-time scaling strategies, such as generating a long Chain-of-Thought before evaluation, effectively reduce harmful self-preference. These results provide a more nuanced understanding of LLM-based evaluation and practical insights for improving its reliability.

Authors (5)

Wei-Lin Chen

Zhepei Wei

Xinyu Zhu

Shi Feng

Yu Meng

Submitted

April 4, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Investigates LLM self-preference bias using verifiable benchmarks (math, facts, code) to distinguish harmful bias from genuine quality preference. This approach allows for objective ground-truth assessment, unlike previous subjective tasks, providing a clearer understanding of whether LLM evaluators favor themselves for valid reasons or due to bias.

Business Value

Ensures the reliability and fairness of LLM-based evaluation systems, crucial for developing trustworthy AI, improving model training (e.g., RLHF), and preventing self-reinforcing biases in AI development.

Paper Metadata

Innovation Type

Evaluation Methodology

Deployment Feasibility

High, as it's an evaluation framework.

Limitations Addressed

Ambiguity in prior studies on LLM self-preference bias due to reliance on subjective tasks lacking objective ground truth.

Technical Tags

LLM evaluatorsself-preference biasreward modelingself-refinementverifiable benchmarksmathematical reasoningfactual knowledgecode generationobjective ground truthsubjective tasks

Research Topics

AI EvaluationLLM BiasReinforcement Learning from Human Feedback (RLHF)Model AlignmentBenchmarking

Methods & Architectures

Evaluation on Verifiable BenchmarksComparative Analysis (LLM vs. LLM, LLM vs. Human)Objective Ground Truth Assessment LLM

Applications & Tasks

AI Benchmarking Model Development Reward Modeling AI Safety Investigating LLM Self-Preference BiasDistinguishing Harmful vs. Legitimate Self-PreferenceDeveloping Objective Evaluation Methods LLM EvaluationReward ModelingBenchmarking

Datasets & Benchmarks

Datasets

mathematical reasoning benchmarks, factual knowledge benchmarks, code generation benchmarks

Self-Preference ScoreAccuracyObjective Quality Assessment

Related Fields

Artificial IntelligenceMachine LearningNatural Language ProcessingAI EthicsSoftware Engineering

Keywords

LLM evaluationself-preferencebiasreward modelingbenchmarkingmathematical reasoningfactual knowledgecode generationobjectivesubjectiveAI safetyartificial intelligencemachine learningnatural language processingreliability

Academic Context

#AI Evaluation#LLM Bias#Reinforcement Learning from Human Feedback (RLHF)#Model Alignment#Benchmarking

Commercial Potential

Potential Products

More reliable LLM-based evaluation platformsTools for detecting and mitigating self-preference bias in AI systems

Target Industries

TechnologySoftware DevelopmentAI ResearchData Science

Use Case Examples

Developing fairer reward models for LLM training.Creating robust benchmarks for comparing different LLMs.Ensuring AI evaluators are objective and not biased towards their own outputs.

Competitive Edge

Provides a more rigorous and objective method for assessing LLM self-preference compared to prior subjective approaches.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers