Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We study Reinforcement Learning from Human Feedback (RLHF) in settings where
multiple labelers may strategically misreport feedback to steer the learned
policy toward their own preferences. We show that existing RLHF algorithms,
including recent pluralistic methods, are not strategyproof, and that even a
single strategic labeler can cause arbitrarily large misalignment with social
welfare. Moreover, we prove that, in the worst case, any strategyproof RLHF
algorithm must perform $k$-times worse than the optimal policy, where $k$ is
the number of labelers. This suggests a fundamental trade-off between incentive
alignment (ensuring labelers report truthfully) and policy alignment
(maximizing social welfare). To address this, we propose the Pessimistic Median
of MLEs algorithm, which, under appropriate policy coverage assumptions, is
approximately strategyproof and converges to the optimal policy as the number
of labelers and samples increases. Our results apply to both contextual bandits
and Markov decision processes.
Authors (4)
Thomas Kleine Buening
Jiarui Gan
Debmalya Mandal
Marta Kwiatkowska
Key Contributions
This work identifies that existing RLHF algorithms are not strategyproof, meaning labelers can strategically misreport feedback to manipulate the learned policy, leading to significant misalignment. It proves a fundamental trade-off between strategyproofness and policy alignment and introduces the 'Pessimistic Median of MLEs' algorithm, which is approximately strategyproof and converges to the optimal policy under certain assumptions.
Business Value
Crucial for developing trustworthy AI systems, especially LLMs, by ensuring that human feedback genuinely reflects desired behavior rather than being exploited by strategic users, leading to safer and more reliable AI.