Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Human feedback can alter language models in unpredictable and undesirable
ways, as practitioners lack a clear understanding of what feedback data
encodes. While prior work studies preferences over certain attributes (e.g.,
length or sycophancy), automatically extracting relevant features without
pre-specifying hypotheses remains challenging. We introduce What's In My Human
Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders.
WIMHF characterizes both (1) the preferences a dataset is capable of measuring
and (2) the preferences that the annotators actually express. Across 7
datasets, WIMHF identifies a small number of human-interpretable features that
account for the majority of the preference prediction signal achieved by
black-box models. These features reveal a wide diversity in what humans prefer,
and the role of dataset-level context: for example, users on Reddit prefer
informality and jokes, while annotators in HH-RLHF and PRISM disprefer them.
WIMHF also surfaces potentially unsafe preferences, such as that LMArena users
tend to vote against refusals, often in favor of toxic content. The learned
features enable effective data curation: re-labeling the harmful examples in
Arena yields large safety gains (+37%) with no cost to general performance.
They also allow fine-grained personalization: on the Community Alignment
dataset, we learn annotator-specific weights over subjective features that
improve preference prediction. WIMHF provides a human-centered analysis method
for practitioners to better understand and use preference data.
Authors (4)
Rajiv Movva
Smitha Milli
Sewon Min
Emma Pierson
Submitted
October 30, 2025
Key Contributions
Introduces WIMHF (What's In My Human Feedback?), a method using sparse autoencoders to explain feedback data by characterizing both the preferences a dataset can measure and those annotators express. It identifies a small set of human-interpretable features that explain preference prediction signals, revealing diversity in human preferences.
Business Value
Enables more effective and targeted fine-tuning of LLMs by understanding the underlying preferences driving human feedback, leading to models that better align with desired behaviors and user expectations.