arxiv_ai 95% Match Research paper AI safety researchers,LLM developers,ML engineers,Security professionals 1 week ago

Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization

ai-safety › alignment

📄 Abstract

Abstract: Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning, posing both misuse and misalignment risks. Recent studies show that even specialized unlearning methods can be easily reversed. To address this, we systematically evaluate many existing and novel components of unlearning methods and identify ones crucial for irreversible unlearning. We introduce Disruption Masking, a technique in which we only allow updating weights, where the signs of the unlearning gradient and the retaining gradient are the same. This ensures all updates are non-disruptive. Additionally, we identify the need for normalizing the unlearning gradients, and also confirm the usefulness of meta-learning. We combine these insights into MUDMAN (Meta-Unlearning with Disruption Masking and Normalization) and validate its effectiveness at preventing the recovery of dangerous capabilities. MUDMAN outperforms the prior TAR method by 40%, setting a new state-of-the-art for robust unlearning.

Authors (4)

Filip Sondej

Yushi Yang

Mikołaj Kniejski

Marcel Windys

Submitted

June 14, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

This paper introduces MUDMAN (Meta-Unlearning with Disruption Masking and Normalization), a novel method for robust and irreversible LLM unlearning. MUDMAN combines Disruption Masking (updating weights only when gradients align) and gradient normalization, outperforming prior methods like TAR by 40% in preventing the recovery of dangerous capabilities.

Business Value

Enhances the safety and trustworthiness of LLMs by providing a reliable way to remove sensitive or harmful information, crucial for responsible AI deployment in various sectors.

Paper Metadata

Innovation Type

Algorithmic improvement

Deployment Feasibility

High, as it's a technique applied during or after model training to improve safety.

Limitations Addressed

Reversibility of existing unlearning methods,Retention of dangerous knowledge/skills in LLMs,Misuse and misalignment risks

Performance Gains

40% improvement over TAR method

Technical Tags

LLM UnlearningMUDMANMeta-UnlearningDisruption MaskingNormalizationirreversible unlearningdangerous knowledgemisalignment risksgradient updatesstate-of-the-art

Research Topics

AI safetyLLM alignmentUnlearning techniquesModel securityMeta-learning

Methods & Architectures

Disruption MaskingGradient normalizationMeta-learningMUDMAN framework Large Language Models (LLMs)

Applications & Tasks

AI Safety Machine Learning Security LLM Development Removing unwanted knowledgePreventing model reversalEnsuring model safety Unlearning specific data/capabilitiesSecuring LLMs against misuseMitigating alignment risks

Related Fields

AI SafetyMachine LearningDeep LearningPrivacySecurity

Keywords

LLM unlearningAI safetyalignmentMUDMANDisruption Maskingnormalizationmeta-learningirreversibledangerous knowledgemisinformationmodel security

Academic Context

#AI safety#LLM alignment#Unlearning techniques#Model security#Meta-learning

Commercial Potential

Potential Products

Secure LLM development platformsAI safety auditing toolsData privacy solutions for LLMs

Target Industries

TechnologyFinanceHealthcareGovernment

Use Case Examples

Removing proprietary information from a deployed LLMEnsuring an LLM cannot generate harmful contentComplying with data removal requests for AI models

Competitive Edge

Sets a new state-of-the-art for irreversible LLM unlearning, significantly outperforming previous methods.

Resource Requirements

Compute Needs

Moderate to High, depending on the size of the LLM being unlearned.

Data Requirements

Requires access to the model and potentially data related to the knowledge to be unlearned.

Scalability

The method is designed to be applicable to large language models.

Regulatory Considerations

Data privacy regulations (e.g., GDPR, CCPA)

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers