Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The Mixture of Experts (MoE) architecture enables the scaling of Large
Language Models (LLMs) to trillions of parameters by activating a sparse subset
of weights for each input, maintaining constant computational cost during
inference. Concurrently, Low-Rank Adaptation (LoRA) has emerged as a dominant
technique for parameter-efficiently fine-tuning LLMs on specialized tasks. In
this work, we unify these two paradigms into a novel, end-to-end trainable
framework named L-MoE: a Lightweight Mixture of LoRA Experts. L-MoE redefines
MoE experts not as dense feed-forward networks, but as a collection of
task-specialized, low-rank adapters. A lightweight gating network, trained
jointly with the experts, learns to dynamically compose these LoRA adapters by
computing a weighted average of their parameters for each input token. This
composition is fully differentiable, allowing gradients from a standard
auto-regressive language modeling objective to flow back through the entire
architecture, simultaneously refining both the expert adapters and the routing
strategy. This approach creates a highly parameter-efficient MoE model that is
modular by design, allows for dynamic skill composition, and is trainable from
end-to-end. We present the formal mathematical framework for L-MoE, detailing
the differentiable routing mechanism and the joint optimization objective,
thereby providing a new path toward building more efficient, scalable, and
specialized language models.
Submitted
October 19, 2025
Key Contributions
Introduces L-MoE, a novel framework that unifies Mixture of Experts (MoE) and Low-Rank Adaptation (LoRA) for end-to-end trainable LLMs. L-MoE redefines experts as LoRA adapters, dynamically composed by a lightweight gating network, enabling efficient scaling and specialization.
Business Value
Facilitates the creation of highly efficient and adaptable LLMs, enabling faster fine-tuning for diverse tasks and potentially reducing the computational cost of deploying large models.