arxiv_ai 93% Match Research Paper Web Developers,AI Engineers,Machine Learning Researchers,System Architects 1 week ago

Improving LLM Reasoning via Dependency-Aware Query Decomposition and Logic-Parallel Content Expansion

large-language-models › reasoning

📄 Abstract

Abstract: The integration of Large Language Models (LLMs) into real-time Web applications, such as AI-powered search and conversational agents, presents a fundamental Web infrastructure challenge: reconciling the demand for high-quality, complex reasoning with the stringent low-latency and high-throughput requirements of interactive services. Current LLM reasoning, hindered by computationally inefficient sequential generation and rigid reasoning strategies, creates a critical bottleneck for the Web services. Existing approaches typically optimize the LLM reasoning for either efficiency or quality but struggle to achieve both, and thus fail to meet the dual requirements of modern Web platforms. To overcome these limitations, we propose Orion, a novel and efficient reasoning framework that enables dependency-aware query decomposition and logic-parallel content expansion. Concretely, Orion decomposes a single query reasoning process into two synergistic phases: (1) \textit{key point generation}, which distills logically structured key points through retrieval-augmented few-shot prompting, and (2) \textit{content parallel expansion}, which concurrently elaborates on these points based on a dependency graph to ensure logical consistency. Furthermore, Orion introduces a pipeline scheduling mechanism that exploits the complementary computational characteristics of the two phases (generation imposes pressure on GPU computing and expansion stresses on GPU memory) across multiple queries, enabling cross-query parallelism and dramatically improving reasoning performance (\ie, efficiency and quality). Experiments on diverse benchmarks show that Orion not only delivers up to 4.33x higher token generation speed and 3.42x lower answer latency over the baselines but also improves reasoning quality by up to 18.75% through explicitly modeling inter-point dependencies.

Authors (4)

Xianjun Gao

Jianchun Liu

Hongli Xu

Liusheng Huang

Submitted

October 28, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

This paper proposes 'Orion,' a novel reasoning framework designed to overcome the latency and throughput bottlenecks of LLMs in real-time web applications. Orion employs dependency-aware query decomposition and logic-parallel content expansion, enabling high-quality, complex reasoning while meeting stringent performance requirements. This approach addresses the trade-off between reasoning efficiency and quality that limits current LLM applications on the web.

Business Value

Enables the development of more responsive and sophisticated AI-powered web services, improving user experience and enabling new applications in search, chatbots, and real-time analysis.

Paper Metadata

Innovation Type

Efficient LLM Reasoning Framework

Deployment Feasibility

High. Designed specifically for web infrastructure integration.

Limitations Addressed

Addresses the limitations of current LLM reasoning strategies that are computationally inefficient (sequential generation) and rigid, creating a bottleneck for real-time web services. It overcomes the challenge of achieving both high quality and high efficiency in LLM reasoning.

Performance Gains

Achieves both high-quality reasoning and meets low-latency/high-throughput requirements for web applications, unlike previous methods that optimized for one or the other.

Technical Tags

Large Language ModelsLLMsWeb ApplicationsLow-LatencyHigh-ThroughputReasoning EfficiencyQuery DecompositionLogic-Parallel Content ExpansionOrion frameworkReal-time AI

Research Topics

LLM EfficiencyWeb InfrastructureReal-time AI SystemsDistributed ComputingAI Reasoning

Methods & Architectures

Dependency-aware query decompositionLogic-parallel content expansionOrion framework Large Language Models (LLMs)

Applications & Tasks

Web Services Search Engines Conversational Agents Real-time AI Applications LLM reasoning bottleneck in real-time applicationsBalancing reasoning quality and efficiencyInefficient sequential generationRigid reasoning strategies Enabling high-quality, complex LLM reasoning with low latencyImproving the efficiency of LLM reasoning for web servicesDecomposing complex queries and expanding content in parallel

Related Fields

Artificial IntelligenceMachine LearningNatural Language ProcessingWeb DevelopmentDistributed SystemsPerformance Optimization

Keywords

LLMsReasoningEfficiencyLatencyWeb ApplicationsQuery DecompositionParallel ProcessingReal-time AIOrion FrameworkNLP

Academic Context

#LLM Efficiency#Web Infrastructure#Real-time AI Systems#Distributed Computing#AI Reasoning

Technology Stack

Frameworks & Libraries

LLMs

Commercial Potential

Potential Products

High-performance AI search enginesReal-time conversational AI platformsAI-powered web service components

Target Industries

TechnologyE-commerceCustomer ServiceInformation Retrieval

Use Case Examples

Providing instant, complex answers in a search enginePowering highly responsive AI assistantsReal-time analysis of user queries on a website

Competitive Edge

Offers a novel approach to LLM reasoning that explicitly addresses the dual requirements of quality and efficiency for real-time web applications, outperforming methods that compromise on one aspect.

Market Opportunity

Large and growing market for AI-powered web services and real-time applications.

Revenue Models

Licensing of the Orion frameworkintegration services for web companies.

Resource Requirements

Compute Needs

Requires efficient inference infrastructure for LLMs, optimized for parallel processing.

Data Requirements

Requires diverse queries and associated knowledge bases for decomposition and expansion.

Deployment Constraints

Integration complexity with existing web architectures,Need for efficient LLM serving infrastructure

Scalability

Designed for high throughput and scalability in web environments.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for integration into web platforms.

View Full Paper Back to Papers