Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We characterize how memorization is represented in transformer models and
show that it can be disentangled in the weights of both language models (LMs)
and vision transformers (ViTs) using a decomposition based on the loss
landscape curvature. This insight is based on prior theoretical and empirical
work showing that the curvature for memorized training points is much sharper
than non memorized, meaning ordering weight components from high to low
curvature can reveal a distinction without explicit labels. This motivates a
weight editing procedure that suppresses far more recitation of untargeted
memorized data more effectively than a recent unlearning method
(BalancedSubnet), while maintaining lower perplexity. Since the basis of
curvature has a natural interpretation for shared structure in model weights,
we analyze the editing procedure extensively on its effect on downstream tasks
in LMs, and find that fact retrieval and arithmetic are specifically and
consistently negatively affected, even though open book fact retrieval and
general logical reasoning is conserved. We posit these tasks rely heavily on
specialized directions in weight space rather than general purpose mechanisms,
regardless of whether those individual datapoints are memorized. We support
this by showing a correspondence between task data's activation strength with
low curvature components that we edit out, and the drop in task performance
after the edit. Our work enhances the understanding of memorization in neural
networks with practical applications towards removing it, and provides evidence
for idiosyncratic, narrowly-used structures involved in solving tasks like math
and fact retrieval.
Authors (4)
Jack Merullo
Srihita Vatsavaya
Lucius Bushnaq
Owen Lewis
Submitted
October 28, 2025
Key Contributions
JSON parse error: Unexpected token s in JSON at position 6977