Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 90% Match Research Paper ML Engineers,Researchers in Model Compression,Developers deploying LLMs on edge devices 2 weeks ago

T\'yr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization

large-language-models › model-architecture
📄 Abstract

Abstract: Structural pruning enhances hardware-agnostic inference efficiency for large language models (LLMs) yet often fails to maintain comparable performance. Local pruning performs efficient layer-by-layer compression but ignores global topology. Although global pruning aims to identify an optimal sparse model, intuitive methods typically adopt a two-stage paradigm that first evaluates substructure saliency and then applies global pruning, which ignores inter-structure dependencies and fails to achieve end-to-end optimization. To address these limitations, we propose T\'yr-the-Pruner, an efficient end-to-end search-based global structural pruning framework. This framework constructs a supernet by repeatedly applying local pruning across a range of sparsity ratios to each layer in an LLM, with the core goal of determining the optimal sparsity distribution under a target overall sparsity ratio. Concretely, we introduce an effective local pruning and an expectation error accumulation approach to improve supernet construction. Furthermore, we employ an iterative prune-and-search strategy with coarse-to-fine sparsity granularity to ensure efficient search convergence. Experimental results show that T\'yr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model's performance while removing a challenging 50% of Llama-3.1-70B's parameters. Code will be available at https://github.com/AMD-AGI/Tyr-the-Pruner.
Authors (7)
Guanchen Li
Yixing Xu
Zeping Li
Ji Liu
Xuanwu Yin
Dong Li
+1 more
Submitted
March 12, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

T'yr-the-Pruner proposes an efficient end-to-end search-based global structural pruning framework for LLMs. It constructs a supernet by optimizing sparsity distribution across layers to achieve a target overall sparsity, addressing limitations of local and two-stage global pruning methods.

Business Value

Significantly reduces the computational cost and memory footprint of LLMs, making them deployable on a wider range of devices, including mobile and edge hardware, and lowering operational costs.