Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 92% Match Research Paper AI Ethics Researchers,LLM Developers,NLP Researchers,AI Safety Engineers 1 week ago

Compositional Bias Control in Large Language Models: Preference Learning Fails, Supervision Succeeds

large-language-models › alignment
📄 Abstract

Abstract: Large Language Models (LLMs) still produce gender-stereotyped language even in occupation-neutral contexts that reflect deep societal biases (Rudinger et al., 2018). To address this, prior work has proposed prompting, constrained decoding (Dathathri et al., 2020; Zhou et al., 2024), post-processing, and fine-tuning-based alignment (Rafailov et al., 2023; Ravfogel et al., 2022). However, the comparative efficacy and learning dynamics remain little understood. We report a comparative analysis of six control techniques for bias mitigation: prompt-only, generate-and-filter, DFA-based Ctrl-G decoding, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Iterative Nullspace Projection (INLP). We evaluate each method on a compositional constraint task. This task requires generating sentences that contain at least one agentic and one communal descriptor for each of the twenty Winogender-derived occupations. We quantify trade-offs between control strength and naturalness with evaluations of constraint compliance, lexical diversity, and fluency. Our results reveal key contrasts among the methods: SFT achieves 99.87 +- 0.15% compliance and high lexical diversity, while DPO, despite similar training stability, fails at 4.53 +- 0.82%. Ctrl-G guarantees perfect compliance, but at the cost of severely reduced fluency and diversity. Preference-based learning fundamentally differs: it cannot satisfy compositional constraints, as binary preference signals encode ranking, not logical conjunctions. Only explicit positive supervision enables mitigation of compositional biases; preference-based alignment fails to generalize logical structures, underscoring the limitations of preference learning and the necessity of explicit supervision for fair and fluent controlled generation.
Authors (1)
Atij Mahesh
Submitted
October 24, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

This paper provides a comparative analysis of six bias control techniques for LLMs, demonstrating that while preference learning methods like DPO fail to effectively control compositional bias, supervised fine-tuning (SFT) and other methods succeed, albeit with potential trade-offs in naturalness.

Business Value

Helps developers choose the most effective methods for mitigating harmful biases in LLM outputs, leading to safer and more ethical AI applications.