Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Language model developers typically filter out high-risk content -- such as
toxic or copyrighted text -- from their pre-training data to prevent models
from generating similar outputs. However, removing such data altogether limits
models' ability to recognize and appropriately respond to harmful or sensitive
content. In this paper, we introduce Selective Loss to Understand but Not
Generate (SLUNG), a pre-training paradigm through which models learn to
understand high-risk data without learning to generate it. Instead of uniformly
applying the next-token prediction loss, SLUNG selectively avoids incentivizing
the generation of high-risk tokens while ensuring they remain within the
model's context window. As the model learns to predict low-risk tokens that
follow high-risk ones, it is forced to understand the high-risk content.
Through our experiments, we show that SLUNG consistently improves models'
understanding of high-risk data (e.g., ability to recognize toxic content)
without increasing its generation (e.g., toxicity of model responses). Overall,
our SLUNG paradigm enables models to benefit from high-risk text that would
otherwise be filtered out.
Authors (5)
Ryan Wang
Matthew Finlayson
Luca Soldaini
Swabha Swayamdipta
Robin Jia
Key Contributions
SLUNG introduces a novel pre-training paradigm that enables language models to understand high-risk data (e.g., toxic, copyrighted text) without learning to generate it. By selectively modifying the next-token prediction loss, SLUNG avoids incentivizing the generation of high-risk tokens while forcing the model to comprehend them to predict subsequent low-risk tokens, significantly improving the model's ability to recognize and respond to harmful content.
Business Value
Develops safer and more responsible AI models that can better identify and handle sensitive or harmful content, crucial for applications dealing with user-generated content, content moderation, and ethical AI deployment.