arxiv_ml 95% Match Research Paper LLM Developers,AI Safety Researchers,AI Ethicists,Machine Learning Engineers,Organizations deploying LLMs 1 day ago

Efficiency vs. Alignment: Investigating Safety and Fairness Risks in Parameter-Efficient Fine-Tuning of LLMs

large-language-models › alignment

📄 Abstract

Abstract: Organizations are increasingly adopting and adapting Large Language Models (LLMs) hosted on public repositories such as HuggingFace. Although these adaptations often improve performance on specialized downstream tasks, recent evidence indicates that they can also degrade a model's safety or fairness. Since different fine-tuning techniques may exert distinct effects on these critical dimensions, this study undertakes a systematic assessment of their trade-offs. Four widely used Parameter-Efficient Fine-Tuning methods, LoRA, IA3, Prompt-Tuning, and P-Tuning, are applied to four instruction-tuned model families (Meta-Llama-3-8B, Qwen2.5-7B, Mistral-7B, and Gemma-7B). In total, 235 fine-tuned variants are evaluated across eleven safety hazard categories and nine demographic fairness dimensions. The results show that adapter-based approaches (LoRA, IA3) tend to improve safety scores and are the least disruptive to fairness, retaining higher accuracy and lower bias scores. In contrast, prompt-based methods (Prompt-Tuning and P-Tuning) generally reduce safety and cause larger fairness regressions, with decreased accuracy and increased bias. Alignment shifts are strongly moderated by base model type: LLaMA remains stable, Qwen records modest gains, Gemma experiences the steepest safety decline, and Mistral, which is released without an internal moderation layer, displays the greatest variance. Improvements in safety do not necessarily translate into improvements in fairness, and no single configuration optimizes all fairness metrics simultaneously, indicating an inherent trade-off between these objectives. These findings suggest a practical guideline for safety-critical deployments: begin with a well-aligned base model, favour adapter-based PEFT, and conduct category-specific audits of both safety and fairness.

Authors (5)

Mina Taraghi

Yann Pequignot

Amin Nikanjam

Mohamed Amine Merzouk

Foutse Khomh

Submitted

November 1, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

This study systematically evaluates the trade-offs between efficiency and alignment (safety/fairness) for four popular Parameter-Efficient Fine-Tuning (PEFT) methods applied to four LLM families. It finds that adapter-based methods (LoRA, IA3) are superior in maintaining safety and fairness while retaining accuracy, offering guidance for responsible LLM adaptation.

Business Value

Provides crucial insights for organizations using or developing LLMs, enabling them to choose fine-tuning methods that balance performance gains with essential safety and fairness requirements, thus reducing risks and building more trustworthy AI applications.

Paper Metadata

Innovation Type

Systematic Evaluation and Comparison

Deployment Feasibility

High. The findings directly inform the selection and application of fine-tuning techniques for deploying LLMs responsibly.

Limitations Addressed

Degradation of LLM safety and fairness during fine-tuning,Lack of understanding regarding the impact of different PEFT methods on safety/fairness,Need for efficient fine-tuning techniques that preserve alignment

Performance Gains

Adapter-based approaches (LoRA, IA3) improve safety scores,Adapter-based approaches are least disruptive to fairness,Adapter-based approaches retain higher accuracy and lower bias

Technical Tags

LLM fine-tuningParameter-Efficient Fine-Tuning (PEFT)LoRAIA3Prompt-TuningP-Tuningmodel safetymodel fairnessbiastrade-offsHuggingFaceinstruction-tuned models

Research Topics

LLM AlignmentAI SafetyAI FairnessModel RobustnessEfficient Model Adaptation

Methods & Architectures

Parameter-Efficient Fine-Tuning (PEFT)LoRAIA3Prompt-TuningP-TuningSystematic evaluationComparative analysis Meta-Llama-3-8BQwen2.5-7BMistral-7BGemma-7B

Applications & Tasks

Natural Language Processing AI Safety AI Ethics LLM Deployment Degradation of LLM safety and fairness after fine-tuningTrade-offs between efficiency and alignmentUnderstanding the impact of different PEFT methods Evaluating safety and fairness of fine-tuned LLMsComparing PEFT methodsIdentifying optimal fine-tuning strategies

Datasets & Benchmarks

Benchmarks

11 safety hazard categories • 9 demographic fairness dimensions

Safety scoresFairness dimensionsAccuracyBias scores

Related Fields

Large Language ModelsAI SafetyAI EthicsMachine LearningNatural Language Processing

Keywords

LLMfine-tuningPEFTLoRAIA3Prompt-TuningP-TuningsafetyfairnessbiasalignmentefficiencyHuggingFaceinstruction tuning

Academic Context

#LLM Alignment#AI Safety#AI Fairness#Model Robustness#Efficient Model Adaptation

Companies & Organizations

Companies Mentioned

HuggingFace

Technology Stack

Frameworks & Libraries

LoRAIA3Prompt-TuningP-Tuning

Commercial Potential

Potential Products

Guidelines for safe LLM fine-tuningTools for evaluating LLM safety and fairness

Target Industries

TechnologyAI Development any industry using LLMs

Use Case Examples

Selecting the best PEFT method for a customer service chatbot to avoid biased responsesFine-tuning an LLM for medical advice while ensuring safety and avoiding harmful contentAdapting LLMs for content generation without introducing demographic biases

Competitive Edge

This paper provides a systematic, empirical comparison of PEFT methods concerning safety and fairness, offering practical guidance that goes beyond theoretical considerations.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers