Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Diffusion language models offer parallel token generation and inherent
bidirectionality, promising more efficient and powerful sequence modeling
compared to autoregressive approaches. However, state-of-the-art diffusion
models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference. While they match
the quality of similarly sized autoregressive (AR) models (e.g., Qwen2.5 7B,
Llama3 8B), their iterative denoising requires multiple full-sequence forward
passes, resulting in high computational costs and latency, particularly for
long input prompts and long-context scenarios. Furthermore, parallel token
generation introduces token incoherence problems, and current sampling
heuristics suffer from significant quality drops with decreasing denoising
steps. We address these limitations with two training-free techniques. First,
we propose FreeCache, a Key-Value (KV) approximation caching technique that
reuses stable KV projections across denoising steps, effectively reducing the
computational cost of DLM inference. Second, we introduce Guided Diffusion, a
training-free method that uses a lightweight pretrained autoregressive model to
supervise token unmasking, dramatically reducing the total number of denoising
iterations without sacrificing quality. We conduct extensive evaluations on
open-source reasoning benchmarks, and our combined methods deliver an average
of 12.14x end-to-end speedup across various tasks with negligible accuracy
degradation. For the first time, diffusion language models achieve a comparable
and even faster latency as the widely adopted autoregressive models. Our work
successfully paved the way for scaling up the diffusion language model to a
broader scope of applications across different domains.
Key Contributions
Addresses the slow inference of diffusion language models (DLMs) by proposing two training-free techniques: FreeCache, an efficient KV caching method, and guided diffusion. These methods accelerate inference, reduce computational costs, and improve token coherence without compromising quality, making DLMs more competitive with autoregressive models.
Business Value
Enables faster and more cost-effective deployment of advanced text generation models, potentially leading to real-time applications and wider adoption of DLMs in various NLP tasks.