Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The scalability of large language models for long-context reasoning is
severely constrained by the linear growth of their Transformer key-value cache,
which incurs significant memory and computational costs. We posit that as a
model generates reasoning tokens, the informational value of past generated
tokens diminishes, creating an opportunity for compression. In this work, we
propose to periodically compress the generation KV cache with a learned,
special-purpose token and evict compressed entries. We train the model to
perform this compression via a modified joint distillation and reinforcement
learning (RL) framework. Our training method minimizes overhead over the
conventional RL process, as it leverages RL outputs for distillation.
Empirically, our method achieves a superior memory-accuracy Pareto frontier
compared to both the model without cache compression and training-free
compression techniques.
Key Contributions
Introduces 'Breadcrumbs Reasoning', a method for memory-efficient long-context reasoning in Transformers by periodically compressing the KV cache using learned 'compression beacons' and evicting entries. Trained via a joint distillation and RL framework, it achieves a superior memory-accuracy Pareto frontier compared to uncompressed models and training-free compression techniques.
Business Value
Enables LLMs to handle much longer contexts more efficiently, reducing operational costs and expanding their applicability to tasks requiring extensive background information, such as document analysis, long-form content generation, and complex Q&A.