arxiv_cl 95% Match Research Paper LLM researchers,AI engineers,Robotics developers,Developers of real-time AI systems 2 weeks ago

StreamingThinker: Large Language Models Can Think While Reading

large-language-models › reasoning

📄 Abstract

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks. Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80\% reduction in token waiting before the onset of reasoning and a more than 60\% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning. Code will be released at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker}{this repository.}

Authors (5)

Junlong Tong

Yingqi Fan

Anhao Zhao

Yunpu Ma

Xiaoyu Shen

Submitted

October 20, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper introduces the 'streaming thinking' paradigm for LLMs, allowing them to reason concurrently with input processing, inspired by human cognition. The proposed framework, StreamingThinker, integrates streaming CoT generation, constraint training, and parallel inference, enabling reasoning to unfold dynamically and adjust after input completion, thereby reducing latency and improving attention to early information.

Business Value

Enables LLMs to be used in time-sensitive applications where immediate reasoning is crucial, such as real-time control systems, interactive agents, and dynamic data analysis, leading to more responsive and effective AI solutions.

Paper Metadata

Innovation Type

Algorithmic Paradigm and Framework

Deployment Feasibility

High, as it focuses on improving inference efficiency and adaptability, making LLMs more practical for real-time applications.

Limitations Addressed

Unnecessary latency introduced by waiting for full input,Weakened attention to earlier information in dynamic scenarios,Inability of current LLMs to 'think while reading'

Technical Tags

Streaming ThinkingLarge Language Models (LLMs)Chain of Thought (CoT)Latency ReductionDynamic ScenariosStreaming InferenceQuality ControlOrder-Preserving ReasoningAttention MasksPosition Encoding

Research Topics

LLM ReasoningReal-time AICognitive ArchitecturesEfficient AI InferenceDynamic Information Processing

Methods & Architectures

Streaming thinking paradigmStreaming CoT generationStreaming-constraint trainingStreaming parallel inferenceStreaming reasoning unitsQuality control mechanismsStreaming attention masksPosition encoding Large Language Models (LLMs)Transformer-based models

Applications & Tasks

Real-time decision making Interactive AI systems Robotics Autonomous agents Dynamic information analysis High latency in LLM reasoningWeak attention to earlier information in dynamic scenariosNeed for reasoning that unfolds with input arrival Enabling LLMs to reason while processing inputReducing latency in LLM reasoningImproving LLM performance in dynamic environments

Related Fields

Natural Language ProcessingMachine LearningCognitive ScienceReal-time SystemsArtificial Intelligence

Keywords

LLMStreaming ThinkingChain of ThoughtReasoningLatencyDynamic ScenariosInferenceReal-time AICognitionAttentionTransformerEfficiency

Academic Context

#LLM Reasoning#Real-time AI#Cognitive Architectures#Efficient AI Inference#Dynamic Information Processing

Technology Stack

Frameworks & Libraries

StreamingThinker

Commercial Potential

Potential Products

Real-time AI assistantsLLM-powered control systemsDynamic data analysis platforms

Target Industries

RoboticsAutonomous SystemsGamingCustomer ServiceFinance

Use Case Examples

An AI agent that can react instantly to changing environmental conditionsA chatbot that provides immediate, reasoned responses during a live conversationRobots that can process sensor data and make decisions in real-time

Competitive Edge

Addresses a fundamental limitation of current LLM reasoning (latency and sequential processing) by introducing a novel paradigm inspired by human cognitive processes.

Market Opportunity

Large and growing market for AI systems requiring real-time responsiveness.

Revenue Models

Licensing of the StreamingThinker frameworkintegration into AI platforms.

Resource Requirements

Compute Needs

Potentially lower inference compute requirements due to streaming, but training might still be intensive.

Data Requirements

Requires datasets suitable for dynamic reasoning and potentially sequential input processing.

Deployment Constraints

Complexity of implementing streaming inference, ensuring quality control in a continuous stream.

Scalability

The streaming approach aims to improve scalability for real-time applications by processing information incrementally.

Regulatory Considerations

None explicitly mentionedbut applications in safety-critical systems would require regulatory approval.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years for robust deployment in real-time applications.

Patent Potential

High potential for patents on the streaming thinking paradigm, specific algorithms, and inference techniques.

View Full Paper Back to Papers