arxiv_cl 95% Match Research Paper AI Safety Researchers,ML Engineers,Privacy Experts,LLM Developers 6 days ago

Precise In-Parameter Concept Erasure in Large Language Models

large-language-models › alignment

📄 Abstract

Abstract: Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES (Precise In-parameter Suppression for Concept EraSure), a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 38%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.

Authors (5)

Yoav Gur-Arieh

Clara Suslik

Yihuai Hong

Fazl Barez

Mor Geva

Submitted

May 28, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

PISCES (Precise In-parameter Suppression for Concept EraSure) is a novel framework for precisely erasing entire concepts from LLM parameters by editing directions that encode them. It uses a disentangler model to decompose MLP vectors, identifies concept-associated features, and removes them, achieving modest gains in efficacy over leading methods.

Business Value

Enables organizations to deploy LLMs more safely and responsibly by removing sensitive information, copyrighted material, or biased knowledge, thereby mitigating risks and ensuring compliance.

Paper Metadata

Innovation Type

Precise Model Editing Technique

Deployment Feasibility

Moderate, requires expertise in model interpretability and parameter editing techniques.

Limitations Addressed

Existing methods (fine-tuning, adapters, fact-level editing) are too coarse, shallow, or ineffective for removing entire concepts; LLMs acquire undesirable knowledge during pretraining.

Performance Gains

Achieves modest gains in efficacy over leading erasure methods.,Reduces accuracy on the target concept.

Technical Tags

concept erasureLLMsin-parameter editingdisentangler modelMLP vectorsinterpretability techniquesknowledge removalPISCESmodel editingsensitive information

Research Topics

AI SafetyMachine Learning InterpretabilityModel EditingPrivacyLarge Language Models

Methods & Architectures

In-parameter editingDisentangler modelAutomated interpretability techniquesConcept identificationVector decomposition Large Language Models (LLMs)Gemma 2Llama 3.1MLPs (Multi-Layer Perceptrons)

Applications & Tasks

AI Safety Data Privacy Content Moderation Responsible AI Knowledge RemovalConcept ErasureModel EditingPrivacy PreservationBias Mitigation Precisely erasing concepts from LLM parametersRemoving undesirable knowledge (e.g., sensitive info, copyrighted content)Evaluating erasure efficacyDecomposing MLP vectors into interpretable features

Related Fields

AI EthicsMachine Learning InterpretabilityPrivacy-Preserving AIModel EditingResponsible AI

Keywords

concept erasureLLMsin-parameter editingmodel editinginterpretabilityPISCESknowledge removalprivacyAI safetydisentanglementMLPsensitive informationcopyrightresponsible AIGemmaLlama

Academic Context

#AI Safety#Machine Learning Interpretability#Model Editing#Privacy#Large Language Models

Commercial Potential

Potential Products

Tools for LLM knowledge scrubbingPrivacy-preserving LLM deployment solutionsModel editing services

Target Industries

TechnologyAI DevelopmentData PrivacyRegulated Industries

Use Case Examples

Removing personally identifiable information (PII) from LLM training data before deployment.Erasing copyrighted material learned during pretraining.Mitigating biases or harmful knowledge embedded in LLMs.

Competitive Edge

Offers a more precise and targeted approach to concept erasure directly within model parameters, overcoming the limitations of coarser fine-tuning or adapter-based methods.

Market Opportunity

Growing demand for tools and techniques to ensure AI safety, privacy, and compliance.

Revenue Models

Licensing of the PISCES technologyconsulting services for LLM safety and privacy.

Resource Requirements

Compute Needs

High (for model editing and evaluation)

Data Requirements

LLMs (Gemma 2, Llama 3.1), datasets for concept identification and evaluation.

Deployment Constraints

Requires deep understanding of model internals and potential side effects of parameter editing.

Scalability

The disentangler approach aims for efficiency in identifying and editing concept representations.

Regulatory Considerations

Data privacy regulationsintellectual property rightsresponsible AI deployment.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years

Patent Potential

High (for the PISCES framework and disentangler method)

View Full Paper Back to Papers