Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper AI Safety Researchers,ML Engineers,NLP Developers,AI Ethics Specialists 3 weeks ago

Teaching Models to Understand (but not Generate) High-risk Data

ai-safety › alignment
📄 Abstract

Abstract: Language model developers typically filter out high-risk content -- such as toxic or copyrighted text -- from their pre-training data to prevent models from generating similar outputs. However, removing such data altogether limits models' ability to recognize and appropriately respond to harmful or sensitive content. In this paper, we introduce Selective Loss to Understand but Not Generate (SLUNG), a pre-training paradigm through which models learn to understand high-risk data without learning to generate it. Instead of uniformly applying the next-token prediction loss, SLUNG selectively avoids incentivizing the generation of high-risk tokens while ensuring they remain within the model's context window. As the model learns to predict low-risk tokens that follow high-risk ones, it is forced to understand the high-risk content. Through our experiments, we show that SLUNG consistently improves models' understanding of high-risk data (e.g., ability to recognize toxic content) without increasing its generation (e.g., toxicity of model responses). Overall, our SLUNG paradigm enables models to benefit from high-risk text that would otherwise be filtered out.
Authors (5)
Ryan Wang
Matthew Finlayson
Luca Soldaini
Swabha Swayamdipta
Robin Jia
Submitted
May 5, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

SLUNG introduces a novel pre-training paradigm that enables language models to understand high-risk data (e.g., toxic, copyrighted text) without learning to generate it. By selectively modifying the next-token prediction loss, SLUNG avoids incentivizing the generation of high-risk tokens while forcing the model to comprehend them to predict subsequent low-risk tokens, significantly improving the model's ability to recognize and respond to harmful content.

Business Value

Develops safer and more responsible AI models that can better identify and handle sensitive or harmful content, crucial for applications dealing with user-generated content, content moderation, and ethical AI deployment.