arxiv_ml 95% Match Research Paper AI safety researchers,LLM developers,MLOps engineers,AI ethicists 1 week ago

Improving LLM Safety Alignment with Dual-Objective Optimization

large-language-models › alignment

📄 Abstract

Abstract: Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization (DPO), a widely deployed alignment method, exhibits limitations in both experimental and theoretical contexts as its loss function proves suboptimal for refusal learning. Through gradient-based analysis, we identify these shortcomings and propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge. This approach significantly increases LLM robustness against a wide range of jailbreak attacks, including prefilling, suffix, and multi-turn attacks across both in-distribution and out-of-distribution scenarios. Furthermore, we introduce a method to emphasize critical refusal tokens by incorporating a reward-based token-level weighting mechanism for refusal learning, which further improves the robustness against adversarial exploits. Our research also suggests that robustness to jailbreak attacks is correlated with token distribution shifts in the training process and internal representations of refusal and harmful tokens, offering valuable directions for future research in LLM safety alignment. The code is available at https://github.com/wicai24/DOOR-Alignment

Authors (7)

Xuandong Zhao

Will Cai

Tianneng Shi

David Huang

Licong Lin

Song Mei

+1 more

Submitted

March 5, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Proposes a dual-objective optimization approach to improve LLM safety alignment, addressing shortcomings of DPO. It disentangles objectives into robust refusal training and targeted unlearning, significantly increasing robustness against various jailbreak attacks.

Business Value

Enhances the safety and reliability of LLM deployments, reducing risks associated with misuse and harmful outputs, which is critical for widespread adoption in sensitive applications.

Paper Metadata

Innovation Type

Novel Optimization Objective

Deployment Feasibility

High, as it's a training-time technique that can be integrated into existing LLM alignment pipelines.

Limitations Addressed

Addresses the vulnerability of existing safety alignment techniques (like DPO) to jailbreak attacks and their suboptimal performance in refusal learning.

Performance Gains

Significantly increases LLM robustness against a wide range of jailbreak attacks.

Technical Tags

LLM safety alignmentJailbreak attacksDual-objective optimizationDirect Preference Optimization (DPO)Refusal trainingUnlearning harmful knowledgeRobustnessToken-level weightingPrefilling attacksSuffix attacks

Research Topics

AI SafetyLLM AlignmentRobustnessAdversarial AttacksMachine Learning Ethics

Methods & Architectures

Dual-objective optimizationGradient-based analysisRobust refusal trainingTargeted unlearningReward-based token-level weighting Large Language Models (LLMs)

Applications & Tasks

AI Safety Responsible AI Natural Language Processing Improving LLM SafetyDefending against JailbreaksEnhancing Refusal Capabilities Increasing LLM robustness to jailbreak attacksImproving refusal learningUnlearning harmful knowledge from LLMs

Datasets & Benchmarks

Benchmarks

Wide range of jailbreak attacks (prefilling, suffix, multi-turn) • In-distribution and out-of-distribution scenarios

Robustness against jailbreak attacksRefusal rateHarmful content generation rate

Related Fields

AI SafetyMachine LearningNatural Language ProcessingCybersecurityEthics

Keywords

LLM safetyalignmentjailbreakrobustnessDPOdual-objective optimizationrefusalunlearningAI safetyadversarial attacksresponsible AI

Academic Context

#AI Safety#LLM Alignment#Robustness#Adversarial Attacks#Machine Learning Ethics

Technology Stack

ML Infrastructure

LLM training and alignment frameworks

Commercial Potential

Potential Products

Safer LLM modelsAlignment toolkitsSecurity modules for LLM applications

Target Industries

TechnologyAI DevelopmentSaaSAny industry deploying LLMs

Use Case Examples

Preventing LLMs from generating harmful instructions or misinformation despite adversarial prompts.Ensuring LLMs reliably refuse inappropriate requests.

Competitive Edge

Offers a more robust and theoretically grounded approach to LLM safety alignment compared to methods like standard DPO.

Market Opportunity

Critical need for safe and aligned AI systems.

Revenue Models

Licensing of improved alignment techniquessafer LLM models.

Resource Requirements

Compute Needs

High, requires significant computational resources for training and fine-tuning LLMs.

Data Requirements

Requires datasets for preference learning, potentially adversarial examples, and harmful content examples.

Deployment Constraints

Requires careful implementation of the dual-objective optimization and weighting mechanisms.

Scalability

Scalability depends on the underlying LLM architecture and the efficiency of the training process.

Regulatory Considerations

Ethical guidelines for AI development and deployment.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years for integration into production LLMs.

View Full Paper Back to Papers