Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Sequence models such as transformers require inputs to be represented as
one-dimensional sequences. In vision, this typically involves flattening images
using a fixed row-major (raster-scan) order. While full self-attention is
permutation-equivariant, modern long-sequence transformers increasingly rely on
architectural approximations that break this invariance and introduce
sensitivity to patch ordering. We show that patch order significantly affects
model performance in such settings, with simple alternatives like column-major
or Hilbert curves yielding notable accuracy shifts. Motivated by this, we
propose REOrder, a two-stage framework for discovering task-optimal patch
orderings. First, we derive an information-theoretic prior by evaluating the
compressibility of various patch sequences. Then, we learn a policy over
permutations by optimizing a Plackett-Luce policy using REINFORCE. This
approach enables efficient learning in a combinatorial permutation space.
REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to
3.01% and Functional Map of the World by 13.35%.
Authors (5)
Declan Kutscher
David M. Chan
Yutong Bai
Trevor Darrell
Ritwik Gupta
Key Contributions
Proposes REOrder, a framework that learns task-optimal patch orderings for vision transformers. It first derives an information-theoretic prior and then uses a Plackett-Luce policy optimized with REINFORCE to efficiently search the combinatorial permutation space, significantly improving model performance by addressing sensitivity to fixed ordering.
Business Value
Enhances the performance of vision models, leading to more accurate and reliable applications in areas like autonomous driving, medical imaging analysis, and surveillance.