arxiv_cl 92% Match Research Paper AI Ethics Researchers,LLM Developers,NLP Researchers,AI Safety Engineers 1 week ago

Compositional Bias Control in Large Language Models: Preference Learning Fails, Supervision Succeeds

large-language-models › alignment

📄 Abstract

Abstract: Large Language Models (LLMs) still produce gender-stereotyped language even in occupation-neutral contexts that reflect deep societal biases (Rudinger et al., 2018). To address this, prior work has proposed prompting, constrained decoding (Dathathri et al., 2020; Zhou et al., 2024), post-processing, and fine-tuning-based alignment (Rafailov et al., 2023; Ravfogel et al., 2022). However, the comparative efficacy and learning dynamics remain little understood. We report a comparative analysis of six control techniques for bias mitigation: prompt-only, generate-and-filter, DFA-based Ctrl-G decoding, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Iterative Nullspace Projection (INLP). We evaluate each method on a compositional constraint task. This task requires generating sentences that contain at least one agentic and one communal descriptor for each of the twenty Winogender-derived occupations. We quantify trade-offs between control strength and naturalness with evaluations of constraint compliance, lexical diversity, and fluency. Our results reveal key contrasts among the methods: SFT achieves 99.87 +- 0.15% compliance and high lexical diversity, while DPO, despite similar training stability, fails at 4.53 +- 0.82%. Ctrl-G guarantees perfect compliance, but at the cost of severely reduced fluency and diversity. Preference-based learning fundamentally differs: it cannot satisfy compositional constraints, as binary preference signals encode ranking, not logical conjunctions. Only explicit positive supervision enables mitigation of compositional biases; preference-based alignment fails to generalize logical structures, underscoring the limitations of preference learning and the necessity of explicit supervision for fair and fluent controlled generation.

Authors (1)

Atij Mahesh

Submitted

October 24, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper provides a comparative analysis of six bias control techniques for LLMs, demonstrating that while preference learning methods like DPO fail to effectively control compositional bias, supervised fine-tuning (SFT) and other methods succeed, albeit with potential trade-offs in naturalness.

Business Value

Helps developers choose the most effective methods for mitigating harmful biases in LLM outputs, leading to safer and more ethical AI applications.

Paper Metadata

Innovation Type

Comparative Analysis and Efficacy Study

Deployment Feasibility

Varies by method; SFT and prompting are generally feasible, while some decoding methods might add computational overhead.

Limitations Addressed

The comparative efficacy and learning dynamics of different bias mitigation techniques for LLMs were poorly understood, particularly for compositional biases.

Performance Gains

Demonstrates the superior efficacy of SFT and other methods over DPO for compositional bias control.

Technical Tags

Bias MitigationCompositional ControlGender BiasStereotypingLLM ControlSupervised Fine-Tuning (SFT)Direct Preference Optimization (DPO)Constrained DecodingPromptingIterative Nullspace Projection (INLP)

Research Topics

AI EthicsFairness in AILLM Control MechanismsBias in Language ModelsModel Alignment

Methods & Architectures

Comparative AnalysisSupervised Fine-Tuning (SFT)Direct Preference Optimization (DPO)PromptingConstrained Decoding (DFA-based Ctrl-G)Iterative Nullspace Projection (INLP)Generate-and-filter Large Language Models (LLMs)

Applications & Tasks

Natural Language Generation AI Safety Content Moderation Gender StereotypingCompositional BiasSocietal Biases in LLMsControl vs. Naturalness Trade-off Bias MitigationControlling LLM OutputGenerating Fair TextOccupation-Neutral Text Generation

Datasets & Benchmarks

Datasets

Winogender

Control StrengthNaturalnessBias Quantification

Related Fields

AI EthicsFairnessNatural Language GenerationMachine Learning AlignmentSociolinguistics

Keywords

BiasLLMsGender StereotypingControlAlignmentSFTDPOConstrained DecodingPromptingINLPFairnessCompositional BiasNaturalness

Academic Context

#AI Ethics#Fairness in AI#LLM Control Mechanisms#Bias in Language Models#Model Alignment

Commercial Potential

Potential Products

Bias-controlled LLM APIsEthical AI development tools

Target Industries

TechnologyMediaCustomer ServiceHuman Resources

Use Case Examples

Generating unbiased job descriptionsCreating fair and representative AI-generated contentMitigating harmful stereotypes in chatbot responses

Competitive Edge

Provides empirical evidence to guide the selection of bias mitigation techniques, differentiating between effective and ineffective approaches.

Market Opportunity

Growing demand for ethical and fair AI systems.

Revenue Models

Consulting services for bias mitigationlicensing of effective control algorithms.

Resource Requirements

Compute Needs

Moderate to High, depending on the scale of LLMs and the number of control techniques evaluated.

Data Requirements

A compositional bias task dataset (e.g., derived from Winogender) and potentially large text corpora for LLM training/finetuning.

Deployment Constraints

Some control methods might introduce latency or complexity during inference.

Scalability

The comparative framework can be extended to evaluate more control techniques and bias types.

Regulatory Considerations

Potential implications for AI fairness regulations.

Production Readiness

Maturity Level

Research

Time to Market

N/A

Patent Potential

Low

View Full Paper Back to Papers