arxiv_cl 92% Match Research Paper AI researchers,ML engineers,Developers optimizing LLM performance,Researchers in AI efficiency 2 weeks ago

Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains

large-language-models › reasoning

📄 Abstract

Abstract: Large Language Models (LLMs) achieve superior performance through Chain-of-Thought (CoT) reasoning, but these token-level reasoning chains are computationally expensive and inefficient. In this paper, we introduce Compressed Latent Reasoning (CoLaR), a novel framework that dynamically compresses reasoning processes in latent space through a two-stage training approach. First, during supervised fine-tuning, CoLaR extends beyond next-token prediction by incorporating an auxiliary next compressed embedding prediction objective. This process merges embeddings of consecutive tokens using a compression factor randomly sampled from a predefined range, and trains a specialized latent head to predict distributions of subsequent compressed embeddings. Second, we enhance CoLaR through reinforcement learning (RL) that leverages the latent head's non-deterministic nature to explore diverse reasoning paths and exploit more compact ones. This approach enables CoLaR to: i) perform reasoning at a dense latent level (i.e., silently), substantially reducing reasoning chain length, and ii) dynamically adjust reasoning speed at inference time by simply prompting the desired compression factor. Extensive experiments across four mathematical reasoning datasets demonstrate that CoLaR achieves 14.1% higher accuracy than latent-based baseline methods at comparable compression ratios, and reduces reasoning chain length by 53.3% with only 4.8% performance degradation compared to explicit CoT method. Moreover, when applied to more challenging mathematical reasoning tasks, our RL-enhanced CoLaR demonstrates performance gains of up to 5.4% while dramatically reducing latent reasoning chain length by 82.8%. The code and models will be released upon acceptance.

Authors (6)

Wenhui Tan

Jiaze Li

Jianzhong Ju

Zhenbo Luo

Jian Luan

Ruihua Song

Submitted

May 22, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduces CoLaR, a framework for dynamically compressing LLM reasoning chains in latent space using a two-stage approach (SFT + RL). It trains a latent head to predict compressed embeddings and uses RL to explore more compact reasoning paths, significantly reducing computational cost while maintaining reasoning quality.

Business Value

Reduces the operational costs of deploying LLMs for complex reasoning tasks, making advanced AI capabilities more accessible and scalable for businesses.

Paper Metadata

Innovation Type

Algorithmic Innovation

Deployment Feasibility

Moderate, requires integration into the LLM training and inference pipeline.

Limitations Addressed

The computational cost and inefficiency associated with standard Chain-of-Thought (CoT) reasoning in LLMs.

Performance Gains

Enables reasoning with significantly reduced computational cost and token usage.

Technical Tags

large language models (LLMs)Chain-of-Thought (CoT)reasoning compressionlatent spacesupervised fine-tuningreinforcement learning (RL)compressed latent reasoning (CoLaR)embedding predictionnon-deterministic exploration

Research Topics

LLM ReasoningEfficiency in AIModel CompressionReinforcement LearningNatural Language Processing

Methods & Architectures

Compressed Latent Reasoning (CoLaR)two-stage training (SFT + RL)auxiliary next compressed embedding predictionlatent headreinforcement learningdynamic latent compression Large Language Models (LLMs)

Applications & Tasks

Natural Language Processing AI Efficiency Computational expense and inefficiency of LLM reasoning chains (CoT)Need for more compact reasoning processes Compressing LLM reasoning processes in latent spaceEnabling faster and more efficient reasoningExploring diverse and compact reasoning paths

Related Fields

Artificial IntelligenceMachine LearningNatural Language ProcessingDeep Learning Optimization

Keywords

LLMChain-of-Thoughtreasoningcompressionlatent spaceCoLaRreinforcement learningefficiencyfine-tuningNLP

Academic Context

#LLM Reasoning#Efficiency in AI#Model Compression#Reinforcement Learning#Natural Language Processing

Commercial Potential

Potential Products

Optimized LLM inference enginesTools for efficient AI reasoningCost-reduction solutions for AI deployment

Target Industries

TechnologyCloud ComputingSaaSAI Services

Use Case Examples

Faster execution of complex problem-solving tasks by LLMsReducing the energy consumption of AI modelsEnabling LLM reasoning on resource-constrained devices

Competitive Edge

Addresses the efficiency bottleneck of CoT reasoning directly through latent space compression and RL exploration, offering a more dynamic and potentially more effective compression strategy than static methods.

Market Opportunity

Large, driven by the increasing adoption and cost concerns of large AI models.

Revenue Models

Licensing of the compression technologyintegration into AI platformsconsulting services.

Resource Requirements

Compute Needs

Requires significant compute resources for the two-stage training process (SFT + RL).

Data Requirements

Requires datasets suitable for supervised fine-tuning and reinforcement learning reward signals.

Deployment Constraints

Integration complexity into existing LLM architectures,Potential trade-off between compression and accuracy

Scalability

Aims to improve scalability by reducing computational requirements per inference.

Production Readiness

Maturity Level

Research

Time to Market

Medium, for integration and optimization.

Patent Potential

Moderate, for the CoLaR framework and training methodology.

View Full Paper Back to Papers