arxiv_cl 97% Match Research Paper Mobile developers,AI researchers,Hardware engineers,ML engineers 2 weeks ago

Accelerating Mobile Language Model Generation via Hybrid Context and Hardware Coordination

large-language-models › model-architecture

📄 Abstract

Abstract: Enhancing on-device large language models (LLMs) with contextual information from local data enables personalized and task-aware generation, powering use cases such as intelligent assistants and UI agents. While recent developments in neural processors have substantially improved the efficiency of prefill on mobile devices, the token-by-token generation process still suffers from high latency and limited hardware utilization due to its inherently memory-bound characteristics. This work presents sd.npu, a mobile inference framework that integrates speculative decoding with dynamic hardware scheduling to accelerate context-aware text generation on mobile devices. The framework introduces three synergistic components: (1) adaptive execution scheduling, which dynamically balances compute graphs between prefill and decoding phases; (2) context-aligned drafting, which improves speculative efficiency through lightweight online calibration to current tasks; and (3) hardware-efficient draft extension, which reuses and expands intermediate sequences to improve processing parallelism and reduce verification cost. Experiments on multiple smartphones and representative workloads show consistent improvements of up to 3.8x in generation speed and 4.7x in energy efficiency compared with existing mobile inference solutions. Component-level analysis further validates the contribution of each optimization.

Authors (6)

Zhiyang Chen

Daliang Xu

Haiyang Shen

Mengwei Xu

Shangguang Wang

Yun Ma

Submitted

October 17, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This work introduces sd.npu, a mobile inference framework that accelerates context-aware text generation on mobile devices by integrating speculative decoding with dynamic hardware scheduling. It features adaptive execution scheduling, context-aligned drafting for improved speculative efficiency, and hardware-efficient design to overcome the memory-bound nature of token-by-token generation.

Business Value

Enables richer, more responsive AI experiences directly on user devices, improving privacy, reducing reliance on cloud connectivity, and powering new categories of mobile applications.

Paper Metadata

Innovation Type

System/Framework Design

Deployment Feasibility

High, specifically designed for mobile hardware.

Limitations Addressed

High latency and limited hardware utilization during the token-by-token generation process of LLMs on mobile devices, which is memory-bound.

Performance Gains

Significant acceleration of context-aware text generation on mobile devices.

Technical Tags

Mobile LLMsOn-device inferenceSpeculative decodingHardware accelerationContext-aware generationMemory-boundLatency reductionNeural processorsDynamic schedulingPrefill phase

Research Topics

Efficient LLM InferenceMobile AIHardware-Software Co-designOn-device Machine LearningPersonalized AI

Methods & Architectures

Speculative decodingDynamic hardware schedulingAdaptive execution schedulingContext-aligned draftingHardware-efficient design Large Language Models (LLMs)Mobile inference frameworks

Applications & Tasks

Mobile Computing Personalized Assistants User Interface Agents Edge AI High latency in LLM generation on mobileLimited hardware utilization on mobileMemory-bound generation processIntegrating local data for personalization Context-aware text generationOn-device LLM inferencePersonalized user interactions

Related Fields

Mobile ComputingComputer ArchitectureEmbedded SystemsMachine Learning Optimization

Keywords

Mobile LLMOn-device AISpeculative DecodingHardware AccelerationContext-AwareInferenceLatencyNeural Processorsd.npuEdge AILLM Generation

Academic Context

#Efficient LLM Inference#Mobile AI#Hardware-Software Co-design#On-device Machine Learning#Personalized AI

Technology Stack

Frameworks & Libraries

sd.npu

ML Infrastructure

Mobile inference frameworks

Commercial Potential

Potential Products

On-device AI assistantsReal-time translation appsSmart UI agents

Target Industries

Mobile TechnologyConsumer ElectronicsSoftware Development

Use Case Examples

Personalized mobile assistants that understand user contextReal-time text generation for mobile applicationsOn-device content creation tools

Competitive Edge

Addresses the specific challenge of efficient LLM inference on resource-constrained mobile devices.

Market Opportunity

Massive global smartphone market,Increasing demand for on-device AI features

Revenue Models

Licensing the framework to device manufacturers or app developers.

Resource Requirements

Compute Needs

Optimized for mobile neural processors.

Data Requirements

Data for training and evaluating context-aware generation.

Deployment Constraints

Limited computational power and memory on mobile devices,Energy efficiency requirements

Scalability

Designed for mobile scalability.

Production Readiness

Maturity Level

Research Prototype

Time to Market

1-2 years for integration into mobile OS or apps.

Patent Potential

Potential for patents on the sd.npu framework and its components.

View Full Paper Back to Papers