Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: As Large Language Models (LLMs) are increasingly deployed for narrow tasks in
resource-constrained settings, a central question arises: how much of an LLM is
truly necessary for a given task? We present LLM-Sieve, a framework that prunes
LLMs down to the minimal parameter subset needed to preserve task performance.
Our approach introduces two innovations: (i) output-aligned non-orthogonal
projections, which yield more faithful low-rank approximations than traditional
PCA/SVD by aligning directly with layer outputs; and (ii) adaptive pruning via
a Genetic Algorithm, which automatically discovers matrix-specific pruning
levels and exposes the uneven distribution of task-relevant knowledge. Across
models from 3.8B to 70B parameters, LLM-Sieve removes 20-75% of weights with
only 1-5% accuracy loss-substantially ahead of prior pruning methods. Beyond
efficiency, our framework reveals bottleneck matrices that concentrate critical
knowledge, suggesting architectural implications for future LLM design.
LLM-Sieve integrates seamlessly with LoRA fine-tuning and quantization,
enabling both efficient deployment and deeper understanding of knowledge
organization in LLMs.
Key Contributions
Introduces LLM-Sieve, a framework for task-specific LLM pruning that removes up to 75% of weights with minimal accuracy loss (1-5%). It uses output-aligned projections and a Genetic Algorithm for adaptive pruning, revealing bottleneck matrices that concentrate critical knowledge.
Business Value
Enables efficient deployment of powerful LLMs on edge devices and reduces inference costs, broadening the applicability of LLMs in various industries and applications.