Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in
chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm
initiates thinking only after the entire input is available, which introduces
unnecessary latency and weakens attention to earlier information in dynamic
scenarios. Inspired by human cognition of thinking while reading, we first
design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where
reasoning unfolds in the order of input and further adjusts its depth once
reading is complete. We instantiate this paradigm with
\textit{StreamingThinker}, a framework that enables LLMs to think while reading
through the integration of streaming CoT generation, streaming-constraint
training, and streaming parallel inference. Specifically, StreamingThinker
employs streaming reasoning units with quality control for CoT generation,
enforces order-preserving reasoning through streaming attention masks and
position encoding, and leverages parallel KV caches that decouple input
encoding from reasoning generation, thereby ensuring alignment and enabling
true concurrency. We evaluate StreamingThinker on the Qwen3 model family across
math reasoning, logical reasoning, and context-based QA reasoning tasks.
Experimental results show that the StreamingThinker preserves performance
comparable to batch thinking, while yielding an 80\% reduction in token waiting
before the onset of reasoning and a more than 60\% reduction in time-level
latency for producing the final answer, demonstrating the effectiveness of the
streaming paradigm for LLM reasoning. Code will be released at
\href{https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker}{this
repository.}
Authors (5)
Junlong Tong
Yingqi Fan
Anhao Zhao
Yunpu Ma
Xiaoyu Shen
Submitted
October 20, 2025
Key Contributions
This paper introduces the 'streaming thinking' paradigm for LLMs, allowing them to reason concurrently with input processing, inspired by human cognition. The proposed framework, StreamingThinker, integrates streaming CoT generation, constraint training, and parallel inference, enabling reasoning to unfold dynamically and adjust after input completion, thereby reducing latency and improving attention to early information.
Business Value
Enables LLMs to be used in time-sensitive applications where immediate reasoning is crucial, such as real-time control systems, interactive agents, and dynamic data analysis, leading to more responsive and effective AI solutions.