Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
π Abstract
Abstract: Recent alignment techniques, such as reinforcement learning from human
feedback, have been widely adopted to align large language models with human
preferences by learning and leveraging reward models. In practice, these models
often exploit spurious correlations, involving, e.g., response length,
discrimination, sycophancy, and conceptual bias, which is a problem that has
received increasing attention. In this work, we propose a principled framework
that mitigates these biases in reward models while preserving the underlying
factors that reflect intended preferences. We first provide a formulation of
the data-generating process, assuming that the observed data (e.g., text) is
generated from both spurious and non-spurious latent variables. We show that,
interestingly, these non-spurious latent variables can be theoretically
identified from data, regardless of whether a surrogate for the spurious latent
variables is available. This further inspires a practical method that uses
variational inference to recover these variables and leverages them to train
reward models. Experiments on synthetic and real-world datasets demonstrate
that our method effectively mitigates spurious correlation issues and yields
more robust reward models.
Authors (5)
Ignavier Ng
Patrick BlΓΆbaum
Siddharth Bhandari
Kun Zhang
Shiva Kasiviswanathan
Submitted
October 27, 2025
Key Contributions
This work proposes a principled framework using representation learning to mitigate biases (e.g., response length, discrimination, sycophancy) in reward models used for LLM alignment, while preserving underlying intended preferences. It theoretically identifies non-spurious latent variables from data, enabling a practical debiasing method.
Business Value
Improves the reliability and trustworthiness of LLMs by ensuring they align with genuine human preferences rather than superficial correlations, leading to safer and more useful AI applications.