arxiv_ai 85% Match Research Paper ML Engineers,AI Researchers,Hardware Designers,Developers deploying AI models 1 week ago

STaMP: Sequence Transformation and Mixed Precision for Low-Precision Activation Quantization

generative-ai › autoregressive

📄 Abstract

Abstract: Quantization is the key method for reducing inference latency, power and memory footprint of generative AI models. However, accuracy often degrades sharply when activations are quantized below eight bits. Recent work suggests that invertible linear transformations (e.g. rotations) can aid quantization, by reparameterizing feature channels and weights. In this paper, we propose \textit{Sequence Transformation and Mixed Precision} (STaMP) quantization, a novel strategy that applies linear transformations along the \textit{sequence} dimension to exploit the strong local correlation in language and visual data. By keeping a small number of tokens in each intermediate activation at higher precision, we can maintain model accuracy at lower (average) activations bit-widths. We evaluate STaMP on recent LVM and LLM architectures, demonstrating that it significantly improves low bit width activation quantization and complements established activation and weight quantization methods including recent feature transformations.

Authors (5)

Marco Federici

Riccardo Del Chiaro

Boris van Breugel

Paul Whatmough

Markus Nagel

Submitted

October 30, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Proposes STaMP quantization, a novel strategy that applies linear transformations along the sequence dimension for low-precision activation quantization. It maintains accuracy at lower bit-widths by keeping a small number of tokens at higher precision, significantly improving efficiency for LLMs and LVMs.

Business Value

Enables the deployment of large generative AI models on resource-constrained devices and reduces operational costs for cloud-based inference, making advanced AI more accessible and efficient.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

High, as it directly targets efficiency improvements for inference, making models more practical for real-world deployment.

Limitations Addressed

Addresses the accuracy degradation issue in quantizing activations below eight bits, enabling significant reductions in inference cost for generative AI models.

Performance Gains

Significantly improves low bit width activation quantization, maintaining model accuracy at lower average bit-widths.

Technical Tags

quantizationlow-precision activationsequence transformationmixed precisiongenerative AILLMLVMinference latencymemory footprint

Research Topics

Model CompressionEfficient Deep LearningGenerative ModelsHardware AccelerationQuantization Techniques

Methods & Architectures

Sequence Transformation and Mixed Precision (STaMP) quantizationLinear Transformations (rotations)Low bit-width activation quantization LVM (Large Vision Models)LLM (Large Language Models)

Applications & Tasks

Artificial Intelligence Machine Learning Computer Vision Natural Language Processing High Inference CostLarge Model SizeAccuracy Degradation during Quantization Reducing inference latency, power, and memory footprintMaintaining model accuracy at low bit-widths

Related Fields

Machine LearningDeep LearningComputer VisionNatural Language ProcessingHardware Acceleration

Keywords

quantizationlow-precisionactivation quantizationgenerative AILLMLVMinference efficiencymodel compressionsequence transformationmixed precision

Academic Context

#Model Compression#Efficient Deep Learning#Generative Models#Hardware Acceleration#Quantization Techniques

Commercial Potential

Potential Products

Quantization libraries for generative modelsOptimized inference engines

Target Industries

TechnologyCloud ComputingMobile DevicesEdge Computing

Use Case Examples

Deploying large language models on mobile phonesReducing inference costs for cloud-based AI servicesAccelerating image generation models

Competitive Edge

Offers a novel approach to activation quantization by leveraging sequence transformations, complementing existing weight quantization methods and achieving better accuracy at low bit-widths.

Market Opportunity

Huge market for efficient AI inference, especially for generative models.

Revenue Models

Licensing of the techniqueintegration into commercial AI platforms.

Resource Requirements

Compute Needs

Moderate for applying transformations during training/inference, but significantly reduces inference compute needs.

Data Requirements

Requires access to large generative models (LLMs, LVMs) and relevant training/evaluation data.

Deployment Constraints

Requires integration into the model's inference pipeline; potential compatibility issues with existing hardware/software stacks.

Scalability

Scalable to various LLM and LVM architectures.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for integration into frameworks and libraries.

Patent Potential

Moderate, for the STaMP quantization technique.

View Full Paper Back to Papers