Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper ML Researchers,NLP Engineers,Developers working with LLMs,AI Infrastructure Engineers 2 weeks ago

Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons

large-language-models › model-architecture
📄 Abstract

Abstract: The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value cache, which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes, creating an opportunity for compression. In this work, we propose to periodically compress the generation KV cache with a learned, special-purpose token and evict compressed entries. We train the model to perform this compression via a modified joint distillation and reinforcement learning (RL) framework. Our training method minimizes overhead over the conventional RL process, as it leverages RL outputs for distillation. Empirically, our method achieves a superior memory-accuracy Pareto frontier compared to both the model without cache compression and training-free compression techniques.

Key Contributions

Introduces 'Breadcrumbs Reasoning', a method for memory-efficient long-context reasoning in Transformers by periodically compressing the KV cache using learned 'compression beacons' and evicting entries. Trained via a joint distillation and RL framework, it achieves a superior memory-accuracy Pareto frontier compared to uncompressed models and training-free compression techniques.

Business Value

Enables LLMs to handle much longer contexts more efficiently, reducing operational costs and expanding their applicability to tasks requiring extensive background information, such as document analysis, long-form content generation, and complex Q&A.