Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Structural pruning enhances hardware-agnostic inference efficiency for large
language models (LLMs) yet often fails to maintain comparable performance.
Local pruning performs efficient layer-by-layer compression but ignores global
topology. Although global pruning aims to identify an optimal sparse model,
intuitive methods typically adopt a two-stage paradigm that first evaluates
substructure saliency and then applies global pruning, which ignores
inter-structure dependencies and fails to achieve end-to-end optimization. To
address these limitations, we propose T\'yr-the-Pruner, an efficient end-to-end
search-based global structural pruning framework. This framework constructs a
supernet by repeatedly applying local pruning across a range of sparsity ratios
to each layer in an LLM, with the core goal of determining the optimal sparsity
distribution under a target overall sparsity ratio. Concretely, we introduce an
effective local pruning and an expectation error accumulation approach to
improve supernet construction. Furthermore, we employ an iterative
prune-and-search strategy with coarse-to-fine sparsity granularity to ensure
efficient search convergence. Experimental results show that T\'yr-the-Pruner
achieves state-of-the-art structural pruning, retaining 97% of the dense
model's performance while removing a challenging 50% of Llama-3.1-70B's
parameters. Code will be available at
https://github.com/AMD-AGI/Tyr-the-Pruner.
Authors (7)
Guanchen Li
Yixing Xu
Zeping Li
Ji Liu
Xuanwu Yin
Dong Li
+1 more
Key Contributions
T'yr-the-Pruner proposes an efficient end-to-end search-based global structural pruning framework for LLMs. It constructs a supernet by optimizing sparsity distribution across layers to achieve a target overall sparsity, addressing limitations of local and two-stage global pruning methods.
Business Value
Significantly reduces the computational cost and memory footprint of LLMs, making them deployable on a wider range of devices, including mobile and edge hardware, and lowering operational costs.