arxiv_cl 95% Match Research Paper LLM Researchers,Machine Learning Engineers,AI Infrastructure Developers 2 weeks ago

Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference

large-language-models › model-architecture

📄 Abstract

Abstract: Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent multi-turn dialogue. However, such extended contexts exacerbate the quadratic cost of self-attention, leading to severe latency in autoregressive decoding. Existing sparse attention methods alleviate these costs but rely on heuristic patterns that struggle to recall critical key-value (KV) pairs for each query, resulting in accuracy degradation. We introduce Adamas, a lightweight yet highly accurate sparse attention mechanism designed for long-context inference. Adamas applies the Hadamard transform, bucketization and 2-bit compression to produce compact representations, and leverages Manhattan-distance estimation for efficient top-k selections. Experiments show that Adamas matches the accuracy of full attention with only a 64-token budget, achieves near-lossless performance at 128, and supports up to 8x higher sparsity than prior state-of-the-art (SOTA) methods while delivering up to 4.4x self-attention and 1.5x end-to-end speedups on 32K-length sequences. Remarkably, Adamas attains comparable or even lower perplexity than full attention, underscoring its effectiveness in maintaining accuracy under aggressive sparsity.

Authors (7)

Siyuan Yan

Guo-Qing Jiang

Yuchen Zhang

Xiaoxing Ma

Ran Zhu

Chun Cao

+1 more

Submitted

October 21, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Adamas is a novel, lightweight, and accurate sparse attention mechanism designed for efficient long-context LLM inference. It uses Hadamard transform, bucketization, and 2-bit compression to create compact representations and Manhattan-distance estimation for efficient top-k selection, achieving full attention accuracy with significantly reduced computational cost (e.g., 64-token budget).

Business Value

Enables faster and more cost-effective deployment of LLMs for tasks requiring long context, such as summarizing lengthy documents, analyzing large codebases, or maintaining extended dialogues, making these applications more practical and accessible.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

High. Designed as a lightweight mechanism to be integrated into existing LLM architectures.

Limitations Addressed

Quadratic computational and memory cost of self-attention with long contexts,Accuracy degradation in existing sparse attention methods,Severe latency issues during autoregressive decoding for long sequences

Performance Gains

Matches full attention accuracy with a 64-token budget,Significant reduction in latency and computational cost for long contexts

Technical Tags

Sparse AttentionLong ContextLLM InferenceHadamard Transform2-bit CompressionManhattan DistanceEfficiencyLatency ReductionAutoregressive DecodingKV Cache Optimization

Research Topics

Efficient Deep LearningTransformer ArchitecturesNatural Language ProcessingMachine Learning OptimizationLarge Language Models

Methods & Architectures

Hadamard TransformBucketization2-bit CompressionManhattan-distance EstimationSparse Attention MechanismTop-k Selection Sparse Attention Mechanism (Adamas)Transformer-based LLMs

Applications & Tasks

Natural Language Processing Large-Scale Text Generation Information Processing Quadratic Complexity of Self-AttentionHigh Latency in Long-Context InferenceAccuracy Degradation in Sparse AttentionMemory Constraints Efficient Long-Context InferenceReducing Attention Computation CostEnabling Longer Context Windows

Related Fields

Computer ArchitectureAlgorithm DesignInformation TheoryDeep Learning Optimization

Keywords

Sparse AttentionLong ContextLLM InferenceEfficiencyLatencyHadamard TransformTransformerNLPDeep LearningOptimizationKV Cache

Academic Context

#Efficient Deep Learning#Transformer Architectures#Natural Language Processing#Machine Learning Optimization#Large Language Models

Commercial Potential

Potential Products

Efficient LLM inference enginesLibraries for sparse attention mechanismsOptimized LLM deployment solutions

Target Industries

TechnologySoftware DevelopmentCloud ComputingResearch

Use Case Examples

Summarizing entire books or research papersGenerating large code modulesMaintaining coherent, long-running chatbot conversations

Competitive Edge

Provides a more accurate and efficient alternative to existing sparse attention methods, directly tackling the latency and accuracy trade-offs that limit current long-context LLM applications.

Market Opportunity

Rapid growth in LLM applications requiring long context understanding.

Revenue Models

Licensing of the attention mechanismintegration into commercial LLM products.

Resource Requirements

Compute Needs

Reduced compared to full attention for long contexts.

Data Requirements

Standard NLP datasets for evaluating LLM performance.

Deployment Constraints

Integration into existing LLM frameworks might require engineering effort.

Scalability

Designed for scalability to handle extremely long contexts efficiently.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years

Licensing

Likely research/non-commercial.

Patent Potential

Moderate

View Full Paper Back to Papers