Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Enhancing on-device large language models (LLMs) with contextual information
from local data enables personalized and task-aware generation, powering use
cases such as intelligent assistants and UI agents. While recent developments
in neural processors have substantially improved the efficiency of prefill on
mobile devices, the token-by-token generation process still suffers from high
latency and limited hardware utilization due to its inherently memory-bound
characteristics. This work presents sd.npu, a mobile inference framework that
integrates speculative decoding with dynamic hardware scheduling to accelerate
context-aware text generation on mobile devices. The framework introduces three
synergistic components: (1) adaptive execution scheduling, which dynamically
balances compute graphs between prefill and decoding phases; (2)
context-aligned drafting, which improves speculative efficiency through
lightweight online calibration to current tasks; and (3) hardware-efficient
draft extension, which reuses and expands intermediate sequences to improve
processing parallelism and reduce verification cost. Experiments on multiple
smartphones and representative workloads show consistent improvements of up to
3.8x in generation speed and 4.7x in energy efficiency compared with existing
mobile inference solutions. Component-level analysis further validates the
contribution of each optimization.
Authors (6)
Zhiyang Chen
Daliang Xu
Haiyang Shen
Mengwei Xu
Shangguang Wang
Yun Ma
Submitted
October 17, 2025
Key Contributions
This work introduces sd.npu, a mobile inference framework that accelerates context-aware text generation on mobile devices by integrating speculative decoding with dynamic hardware scheduling. It features adaptive execution scheduling, context-aligned drafting for improved speculative efficiency, and hardware-efficient design to overcome the memory-bound nature of token-by-token generation.
Business Value
Enables richer, more responsive AI experiences directly on user devices, improving privacy, reducing reliance on cloud connectivity, and powering new categories of mobile applications.