Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper AI Researchers,ML Engineers,Developers working with LLMs 20 hours ago

TwT: Thinking without Tokens by Habitual Reasoning Distillation with Multi-Teachers' Guidance

large-language-models › reasoning
📄 Abstract

Abstract: Large Language Models (LLMs) have made significant strides in problem-solving by incorporating reasoning processes. However, this enhanced reasoning capability results in an increased number of output tokens during inference, leading to higher computational costs. To address this challenge, we propose TwT (Thinking without Tokens), a method that reduces inference-time costs through habitual reasoning distillation with multi-teachers' guidance, while maintaining high performance. Our approach introduces a Habitual Reasoning Distillation method, which internalizes explicit reasoning into the model's habitual behavior through a Teacher-Guided compression strategy inspired by human cognition. Additionally, we propose Dual-Criteria Rejection Sampling (DCRS), a technique that generates a high-quality and diverse distillation dataset using multiple teacher models, making our method suitable for unsupervised scenarios. Experimental results demonstrate that TwT effectively reduces inference costs while preserving superior performance, achieving up to a 13.6% improvement in accuracy with fewer output tokens compared to other distillation methods, offering a highly practical solution for efficient LLM deployment.

Key Contributions

TwT introduces 'Habitual Reasoning Distillation' to internalize explicit reasoning into LLM behavior, inspired by human cognition, and uses 'Dual-Criteria Rejection Sampling' with multi-teacher guidance to create distillation datasets. This method significantly reduces inference costs and token usage while preserving high performance, addressing a key bottleneck in LLM deployment.

Business Value

Enables the deployment of powerful LLMs in resource-constrained environments or for applications requiring low latency and cost, such as real-time conversational agents or on-device AI.