arxiv_ai 95% Match Research Paper AI alignment researchers,ML engineers,NLP researchers,Data scientists working with human feedback 1 week ago

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

large-language-models › alignment

📄 Abstract

Abstract: Human feedback can alter language models in unpredictable and undesirable ways, as practitioners lack a clear understanding of what feedback data encodes. While prior work studies preferences over certain attributes (e.g., length or sycophancy), automatically extracting relevant features without pre-specifying hypotheses remains challenging. We introduce What's In My Human Feedback? (WIMHF), a method to explain feedback data using sparse autoencoders. WIMHF characterizes both (1) the preferences a dataset is capable of measuring and (2) the preferences that the annotators actually express. Across 7 datasets, WIMHF identifies a small number of human-interpretable features that account for the majority of the preference prediction signal achieved by black-box models. These features reveal a wide diversity in what humans prefer, and the role of dataset-level context: for example, users on Reddit prefer informality and jokes, while annotators in HH-RLHF and PRISM disprefer them. WIMHF also surfaces potentially unsafe preferences, such as that LMArena users tend to vote against refusals, often in favor of toxic content. The learned features enable effective data curation: re-labeling the harmful examples in Arena yields large safety gains (+37%) with no cost to general performance. They also allow fine-grained personalization: on the Community Alignment dataset, we learn annotator-specific weights over subjective features that improve preference prediction. WIMHF provides a human-centered analysis method for practitioners to better understand and use preference data.

Authors (4)

Rajiv Movva

Smitha Milli

Sewon Min

Emma Pierson

Submitted

October 30, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduces WIMHF (What's In My Human Feedback?), a method using sparse autoencoders to explain feedback data by characterizing both the preferences a dataset can measure and those annotators express. It identifies a small set of human-interpretable features that explain preference prediction signals, revealing diversity in human preferences.

Business Value

Enables more effective and targeted fine-tuning of LLMs by understanding the underlying preferences driving human feedback, leading to models that better align with desired behaviors and user expectations.

Paper Metadata

Innovation Type

Novel Method for Explaining Preference Data

Deployment Feasibility

High. The method is analytical and can be applied to existing preference datasets. Requires computational resources for sparse autoencoder training.

Limitations Addressed

Lack of clear understanding of what feedback data encodes in language models; difficulty in automatically extracting relevant features without pre-specifying hypotheses.

Performance Gains

Identifies a small number of human-interpretable features that account for the majority of the preference prediction signal, revealing dataset-specific annotator preferences.

Technical Tags

Human FeedbackPreference DataInterpretable DescriptionsSparse AutoencodersFeature ExtractionModel AlignmentAnnotator PreferencesDataset ContextFeature ImportanceExplainable AI (XAI)

Research Topics

AI AlignmentReinforcement Learning from Human Feedback (RLHF)InterpretabilityHuman-Computer InteractionMachine Learning Explainability

Methods & Architectures

Sparse AutoencodersFeature ExtractionPreference Prediction Models Sparse Autoencoder

Applications & Tasks

Large Language Model Training AI Alignment Research InterpretabilityPreference ModelingFeature Analysis Learning interpretable descriptions of human preference dataUnderstanding what preferences feedback data encodes

Datasets & Benchmarks

Datasets

7 datasets (including Reddit, HH-RLHF, PRISM)

Ability of extracted features to predict preferencesHuman interpretability of features

Related Fields

Machine LearningNatural Language ProcessingAI EthicsHuman-Computer InteractionExplainable AI

Keywords

human feedbackpreference learninginterpretabilitysparse autoencoderfeature extractionAI alignmentRLHFLLM trainingexplainable AIdataset analysis

Academic Context

#AI Alignment#Reinforcement Learning from Human Feedback (RLHF)#Interpretability#Human-Computer Interaction#Machine Learning Explainability

Technology Stack

Frameworks & Libraries

Sparse Autoencoders

Commercial Potential

Potential Products

Preference data analysis toolLLM alignment auditing serviceFeedback data quality assessment module

Target Industries

TechnologyAI DevelopmentResearch Institutions

Use Case Examples

Understanding why annotators prefer shorter vs. longer responses in a summarization taskIdentifying if a dataset implicitly favors formal or informal languageDiagnosing issues in RLHF training by analyzing the underlying preferences captured

Competitive Edge

Offers a data-driven, automated approach to understanding the 'what' and 'why' behind human feedback, providing deeper insights than manual analysis or black-box preference models.

Market Opportunity

Growing importance of AI alignment and interpretability tools.

Revenue Models

Licensing of the analysis methodconsulting services.

Resource Requirements

Compute Needs

Moderate (for training sparse autoencoders)

Data Requirements

Human preference datasets (e.g., comparison data from RLHF).

Deployment Constraints

Requires access to preference datasets,Computational resources for analysis

Scalability

Scalable to different preference datasets.

Regulatory Considerations

None directlybut impacts responsible AI development.

Production Readiness

Maturity Level

Research / Tool Development

Time to Market

1-3 years

Patent Potential

Moderate (novel method for preference data explanation)

View Full Paper Back to Papers