Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large Language Models (LLMs) have made significant strides in problem-solving
by incorporating reasoning processes. However, this enhanced reasoning
capability results in an increased number of output tokens during inference,
leading to higher computational costs. To address this challenge, we propose
TwT (Thinking without Tokens), a method that reduces inference-time costs
through habitual reasoning distillation with multi-teachers' guidance, while
maintaining high performance. Our approach introduces a Habitual Reasoning
Distillation method, which internalizes explicit reasoning into the model's
habitual behavior through a Teacher-Guided compression strategy inspired by
human cognition. Additionally, we propose Dual-Criteria Rejection Sampling
(DCRS), a technique that generates a high-quality and diverse distillation
dataset using multiple teacher models, making our method suitable for
unsupervised scenarios. Experimental results demonstrate that TwT effectively
reduces inference costs while preserving superior performance, achieving up to
a 13.6% improvement in accuracy with fewer output tokens compared to other
distillation methods, offering a highly practical solution for efficient LLM
deployment.
Key Contributions
TwT introduces 'Habitual Reasoning Distillation' to internalize explicit reasoning into LLM behavior, inspired by human cognition, and uses 'Dual-Criteria Rejection Sampling' with multi-teacher guidance to create distillation datasets. This method significantly reduces inference costs and token usage while preserving high performance, addressing a key bottleneck in LLM deployment.
Business Value
Enables the deployment of powerful LLMs in resource-constrained environments or for applications requiring low latency and cost, such as real-time conversational agents or on-device AI.