arxiv_ml 90% Match Research Paper ML Engineers,Researchers in Model Compression,Developers deploying LLMs on edge devices 2 weeks ago

T\'yr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization

large-language-models › model-architecture

📄 Abstract

Abstract: Structural pruning enhances hardware-agnostic inference efficiency for large language models (LLMs) yet often fails to maintain comparable performance. Local pruning performs efficient layer-by-layer compression but ignores global topology. Although global pruning aims to identify an optimal sparse model, intuitive methods typically adopt a two-stage paradigm that first evaluates substructure saliency and then applies global pruning, which ignores inter-structure dependencies and fails to achieve end-to-end optimization. To address these limitations, we propose T\'yr-the-Pruner, an efficient end-to-end search-based global structural pruning framework. This framework constructs a supernet by repeatedly applying local pruning across a range of sparsity ratios to each layer in an LLM, with the core goal of determining the optimal sparsity distribution under a target overall sparsity ratio. Concretely, we introduce an effective local pruning and an expectation error accumulation approach to improve supernet construction. Furthermore, we employ an iterative prune-and-search strategy with coarse-to-fine sparsity granularity to ensure efficient search convergence. Experimental results show that T\'yr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model's performance while removing a challenging 50% of Llama-3.1-70B's parameters. Code will be available at https://github.com/AMD-AGI/Tyr-the-Pruner.

Authors (7)

Guanchen Li

Yixing Xu

Zeping Li

Ji Liu

Xuanwu Yin

Dong Li

+1 more

Submitted

March 12, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

T'yr-the-Pruner proposes an efficient end-to-end search-based global structural pruning framework for LLMs. It constructs a supernet by optimizing sparsity distribution across layers to achieve a target overall sparsity, addressing limitations of local and two-stage global pruning methods.

Business Value

Significantly reduces the computational cost and memory footprint of LLMs, making them deployable on a wider range of devices, including mobile and edge hardware, and lowering operational costs.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

High. The goal is to improve inference efficiency, directly impacting deployment practicality.

Limitations Addressed

LLMs often fail to maintain comparable performance after structural pruning; local pruning ignores global topology; existing global pruning methods ignore inter-structure dependencies and fail end-to-end optimization.

Technical Tags

LLM PruningStructural PruningGlobal SparsityEnd-to-End OptimizationSupernetSparsity DistributionInference EfficiencyHardware-Agnostic

Research Topics

Model CompressionLLM EfficiencyNeural Network PruningOptimization Algorithms

Methods & Architectures

Structural PruningGlobal Sparsity Distribution OptimizationSupernet ConstructionLocal PruningExpectation Error Accumulation Large Language Models (LLMs)Supernet

Applications & Tasks

Natural Language Processing Edge Computing Resource-Constrained Environments Model CompressionEfficiency Optimization Reducing LLM inference costEnabling LLMs on edge devicesAchieving hardware-agnostic inference efficiency

Related Fields

Machine LearningDeep LearningComputer ArchitectureNatural Language Processing

Keywords

LLM PruningStructural PruningModel CompressionInference EfficiencyGlobal SparsityEnd-to-End OptimizationSupernetHardware-AgnosticEdge AIDeep Learning Optimization

Academic Context

#Model Compression#LLM Efficiency#Neural Network Pruning#Optimization Algorithms

Commercial Potential

Potential Products

Optimized LLM inference enginesPruned LLM models for specific applications

Target Industries

TechnologyMobileIoTCloud Computing

Use Case Examples

Running LLMs on smartphones for on-device AI featuresDeploying LLMs in embedded systems with limited resourcesReducing cloud inference costs for LLM-powered services

Competitive Edge

Offers an end-to-end, search-based global pruning method that aims for better performance and efficiency than existing layer-wise or two-stage approaches.

Market Opportunity

Significant market for efficient LLM deployment.

Revenue Models

Licensing of pruning technologyoffering optimized models.

Resource Requirements

Compute Needs

Moderate to High (for the pruning search process)

Data Requirements

Requires large text corpora for training/fine-tuning the LLM before pruning.

Deployment Constraints

The pruning process itself can be computationally intensive.

Scalability

The framework is designed to be scalable to large LLMs.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years

Patent Potential

Moderate (novel pruning framework)

View Full Paper Back to Papers