Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Cross-view matching is fundamentally achieved through cross-attention
mechanisms. However, matching of high-resolution images remains challenging due
to the quadratic complexity and lack of explicit matching constraints in the
existing cross-attention. This paper proposes an attention mechanism,
MatchAttention, that dynamically matches relative positions. The relative
position determines the attention sampling center of the key-value pairs given
a query. Continuous and differentiable sliding-window attention sampling is
achieved by the proposed BilinearSoftmax. The relative positions are
iteratively updated through residual connections across layers by embedding
them into the feature channels. Since the relative position is exactly the
learning target for cross-view matching, an efficient hierarchical cross-view
decoder, MatchDecoder, is designed with MatchAttention as its core component.
To handle cross-view occlusions, gated cross-MatchAttention and a
consistency-constrained loss are proposed. These two components collectively
mitigate the impact of occlusions in both forward and backward passes, allowing
the model to focus more on learning matching relationships. When applied to
stereo matching, MatchStereo-B ranked 1st in average error on the public
Middlebury benchmark and requires only 29ms for KITTI-resolution inference.
MatchStereo-T can process 4K UHD images in 0.1 seconds using only 3GB of GPU
memory. The proposed models also achieve state-of-the-art performance on KITTI
2012, KITTI 2015, ETH3D, and Spring flow datasets. The combination of high
accuracy and low computational complexity makes real-time, high-resolution, and
high-accuracy cross-view matching possible. Code is available at
https://github.com/TingmanYan/MatchAttention.
Authors (5)
Tingman Yan
Tao Liu
Xilian Yang
Qunfei Zhao
Zeyang Xia
Submitted
October 16, 2025
Key Contributions
Proposes MatchAttention, a novel attention mechanism for high-resolution cross-view matching that dynamically matches relative positions, overcoming the quadratic complexity of standard cross-attention. It introduces BilinearSoftmax for sampling and a hierarchical MatchDecoder, along with techniques to handle occlusions.
Business Value
Enables more accurate and efficient 3D reconstruction and scene understanding from high-resolution imagery, crucial for applications like autonomous navigation, robotics, and detailed digital twins.