Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Transformer-based architectures achieve state-of-the-art performance across a
wide range of tasks in natural language processing, computer vision, and speech
processing. However, their immense capacity often leads to overfitting,
especially when training data is limited or noisy. In this research, a unified
family of stochastic regularization techniques has been proposed, i.e.
AttentionDrop with its three different variants, which operate directly on the
self-attention distributions. Hard Attention Masking randomly zeroes out top-k
attention logits per query to encourage diverse context utilization, Blurred
Attention Smoothing applies a dynamic Gaussian convolution over attention
logits to diffuse overly peaked distributions, and Consistency-Regularized
AttentionDrop enforces output stability under multiple independent
AttentionDrop perturbations via a KL-based consistency loss. Results achieved
in the study demonstrate that AttentionDrop consistently improves accuracy,
calibration, and adversarial robustness over standard Dropout, DropConnect, and
R-Drop baselines