Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Pruning is critical for scaling large language models (LLMs). Global pruning
achieves strong performance but requires $\mathcal{O}(N)$ memory, which is
infeasible for billion-parameter models. Local pruning reduces GPU memory usage
to that of a single layer by pruning layers independently, but it neglects
inter-layer dependencies and often leads to suboptimal performance in
high-sparsity regimes. Unlike unstructured pruning, structured pruning produces
regular sparsity patterns that align well with GPU kernels and library
optimizations, making it more hardware-efficient. However, structured pruning
typically relies on global pruning, since structured patterns are more prone to
severe performance degradation under local optimization. To jointly achieve
structured pruning and the memory efficiency of local pruning, we propose a
divide-and-conquer strategy that decomposes the global pruning problem into
coordinated subproblems across different modules, each of which fits within
limited GPU memory. Building on this idea, we design \textbf{STRUPRUNE}, an
ADMM-based framework that integrates structured sparsity into the pruning
process, combining the memory efficiency of local pruning with the hardware
compatibility of structured methods. We derive a closed-form analytical
solution for structured pruning masks that provides an explicit rule for
layer-wise sparsity allocation, and further develop an energy-based asymptotic
framework yielding a softmax-form allocation scheme that simplifies
optimization while adapting to heterogeneous layer importance. Experiments
demonstrate that STRUPRUNE matches the perplexity of global structured pruning
while reducing memory cost from $\mathcal{O}(N)$ to $\mathcal{O}(\sqrt{N})$,
enabling practical deployment at the billion-parameter scale.
Key Contributions
Proposes StructPrune, a structured global pruning method that achieves $\mathcal{O}(\sqrt{N})$ GPU memory usage by decomposing the global pruning problem into coordinated subproblems. This divide-and-conquer strategy jointly achieves structured pruning and memory efficiency, overcoming the $\mathcal{O}(N)$ memory bottleneck of traditional global pruning and the performance limitations of local pruning.
Business Value
Enables the efficient pruning and deployment of extremely large language models, reducing computational costs and hardware requirements for training and inference.