Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Current transformer accelerators primarily focus on optimizing self-attention
due to its quadratic complexity. However, this focus is less relevant for
vision transformers with short token lengths, where the Feed-Forward Network
(FFN) tends to be the dominant computational bottleneck. This paper presents a
low power Vision Transformer accelerator, optimized through algorithm-hardware
co-design. The model complexity is reduced using hardware-friendly dynamic
token pruning without introducing complex mechanisms. Sparsity is further
improved by replacing GELU with ReLU activations and employing dynamic FFN2
pruning, achieving a 61.5\% reduction in operations and a 59.3\% reduction in
FFN2 weights, with an accuracy loss of less than 2\%. The hardware adopts a
row-wise dataflow with output-oriented data access to eliminate data
transposition, and supports dynamic operations with minimal area overhead.
Implemented in TSMC's 28nm CMOS technology, our design occupies 496.4K gates
and includes a 232KB SRAM buffer, achieving a peak throughput of 1024 GOPS at
1GHz, with an energy efficiency of 2.31 TOPS/W and an area efficiency of 858.61
GOPS/mm2.
Authors (2)
Ching-Lin Hsiung
Tian-Sheuan Chang
Submitted
October 16, 2025
Key Contributions
This paper presents a novel low-power Vision Transformer accelerator optimized through algorithm-hardware co-design. It introduces hardware-friendly dynamic token pruning and sparsity improvements (replacing GELU with ReLU, dynamic FFN2 pruning) to significantly reduce operations and weights with minimal accuracy loss. The hardware design features a row-wise dataflow for efficient data access.
Business Value
Enables the deployment of powerful Vision Transformer models on power-constrained devices like mobile phones and edge sensors, opening up new possibilities for on-device AI and real-time vision processing.