Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large Language Models (LLMs) achieve remarkable reasoning capabilities
through transformer architectures with attention mechanisms. However,
transformers suffer from quadratic time and memory complexity in the attention
module (MHA) and require caching key-value states during inference, which
severely limits throughput and scalability. High inference throughput is
critical for agentic tasks, long-context reasoning, efficient deployment under
high request loads, and more efficient test-time compute scaling.
State Space Models (SSMs) such as Mamba offer a promising alternative with
linear inference complexity and a constant memory footprint via recurrent
computation with fixed-size hidden states. In this technical report we
introduce the Apriel-H1 family of hybrid LLMs that combine transformer
attention and SSM sequence mixers for efficient reasoning at 15B model size.
These models are obtained through incremental distillation from a pretrained
reasoning transformer, Apriel-Nemotron-15B-Thinker, progressively replacing
less critical attention layers with linear Mamba blocks.
We release multiple post-distillation variants of Apriel-H1-15B-Thinker with
different SSM-to-MHA ratios and analyse how reasoning performance degrades as
more Mamba layers replace MHA. Additionally, we release a 30/50 hybrid variant
of Apriel-H1, further fine-tuned on a supervised dataset of reasoning traces,
achieving over 2x higher inference throughput when deployed in the
production-ready vLLM environment, with minimal degradation in reasoning
performance. This shows that distilled hybrid SSM-Transformer architectures can
deliver substantial efficiency gains over the pretrained transformer equivalent
without substantially compromising the reasoning quality.
Key Contributions
Introduces the Apriel-H1 family of hybrid LLMs that combine the strengths of transformer attention and State Space Models (SSMs) like Mamba. This hybrid approach aims to achieve efficient reasoning with linear inference complexity and constant memory footprint, addressing the quadratic limitations of standard transformers.
Business Value
Enables more cost-effective and scalable deployment of powerful LLMs for demanding applications like agentic tasks and real-time processing, reducing operational costs.