Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Alignment of Large Language Models (LLMs) typically relies on Reinforcement
Learning from Human Feedback (RLHF) with gradient-based optimizers such as
Proximal Policy Optimization (PPO) or Group Relative Policy Optimization
(GRPO). While effective, these methods require complex distributed training,
large memory budgets, and careful hyperparameter tuning, all of which become
increasingly difficult at billion-parameter scale. We present ESSA,
Evolutionary Strategies for Scalable Alignment, a gradient-free framework that
aligns LLMs using only forward inference and black-box optimization. ESSA
focuses optimization on Low-Rank Adapters (LoRA) and further compresses their
parameter space by optimizing only the singular values from an SVD
decomposition of each adapter matrix. This dimensionality reduction makes
evolutionary search practical even for very large models and allows efficient
operation in quantized INT4 and INT8 inference mode. Across these benchmarks
ESSA improves the test accuracy of Qwen2.5-Math-7B by 12.6% on GSM8K and 14.8%
on PRM800K, and raises the accuracy of LLaMA3.1-8B on IFEval by 22.5%, all
compared with GRPO. In large-scale settings ESSA shows stronger scaling than
gradient-based methods: on Qwen2.5-32B for PRM800K it reaches near-optimal
accuracy twice as fast on 16 GPUs and six times as fast on 128 GPUs compared
with GRPO. These results position evolutionary strategies as a compelling,
hardware-friendly alternative to gradient-based LLM alignment, combining
competitive quality with substantially reduced wall-clock time and engineering
overhead.
Authors (10)
Daria Korotyshova
Boris Shaposhnikov
Alexey Malakhov
Alexey Khokhulin
Nikita Surnachev
Kirill Ovcharenko
+4 more
Key Contributions
Introduces ESSA (Evolutionary Strategies for Scalable Alignment), a gradient-free framework for aligning LLMs using only forward inference and black-box optimization. It optimizes Low-Rank Adapters (LoRA) by focusing on singular values, enabling efficient alignment of very large models even in quantized inference modes.
Business Value
Makes LLM alignment more accessible and cost-effective, allowing smaller teams or organizations to fine-tune large models for specific safety, ethical, or performance requirements, accelerating product development.