Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 95% Match Research Paper AI Safety Researchers,RL Practitioners,HCI Researchers,ML Engineers working on LLM alignment 3 weeks ago

Strategyproof Reinforcement Learning from Human Feedback

reinforcement-learning › rlhf
📄 Abstract

Abstract: We study Reinforcement Learning from Human Feedback (RLHF) in settings where multiple labelers may strategically misreport feedback to steer the learned policy toward their own preferences. We show that existing RLHF algorithms, including recent pluralistic methods, are not strategyproof, and that even a single strategic labeler can cause arbitrarily large misalignment with social welfare. Moreover, we prove that, in the worst case, any strategyproof RLHF algorithm must perform $k$-times worse than the optimal policy, where $k$ is the number of labelers. This suggests a fundamental trade-off between incentive alignment (ensuring labelers report truthfully) and policy alignment (maximizing social welfare). To address this, we propose the Pessimistic Median of MLEs algorithm, which, under appropriate policy coverage assumptions, is approximately strategyproof and converges to the optimal policy as the number of labelers and samples increases. Our results apply to both contextual bandits and Markov decision processes.
Authors (4)
Thomas Kleine Buening
Jiarui Gan
Debmalya Mandal
Marta Kwiatkowska
Submitted
March 12, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

This work identifies that existing RLHF algorithms are not strategyproof, meaning labelers can strategically misreport feedback to manipulate the learned policy, leading to significant misalignment. It proves a fundamental trade-off between strategyproofness and policy alignment and introduces the 'Pessimistic Median of MLEs' algorithm, which is approximately strategyproof and converges to the optimal policy under certain assumptions.

Business Value

Crucial for developing trustworthy AI systems, especially LLMs, by ensuring that human feedback genuinely reflects desired behavior rather than being exploited by strategic users, leading to safer and more reliable AI.