SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation

Published 15 Apr 2026 in cs.CV | (2604.13941v1)

Abstract: Local feature matching plays a critical role in understanding the correspondence between cross-view images. However, traditional methods are constrained by the inherent local nature of feature descriptors, limiting their ability to capture non-local scene information that is essential for accurate cross-view correspondence. In this paper, we introduce SceneGlue, a scene-aware feature matching framework designed to overcome these limitations. SceneGlue leverages a hybridizable matching paradigm that integrates implicit parallel attention and explicit cross-view visibility estimation. The parallel attention mechanism simultaneously exchanges information among local descriptors within and across images, enhancing the scene's global context. To further enrich the scene awareness, we propose the Visibility Transformer, which explicitly categorizes features into visible and invisible regions, providing an understanding of cross-view scene visibility. By combining explicit and implicit scene-level awareness, SceneGlue effectively compensates for the local descriptor constraints. Notably, SceneGlue is trained using only local feature matches, without requiring scene-level groundtruth annotations. This scene-aware approach not only improves accuracy and robustness but also enhances interpretability compared to traditional methods. Extensive experiments on applications such as homography estimation, pose estimation, image matching, and visual localization validate SceneGlue's superior performance. The source code is available at https://github.com/songlin-du/SceneGlue.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper presents SceneGlue, a transformer that integrates implicit attention and explicit visibility estimation for scene-aware feature matching without scene-level annotations.
It introduces a Wave Position Encoder and parallel attention architecture to enhance multi-scale feature representation and improve matching accuracy in diverse vision tasks.
Empirical results on HPatches, R1M, MegaDepth, and Aachen datasets demonstrate superior performance and efficiency over existing state-of-the-art methods.

SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation

Overview

SceneGlue addresses the prevailing limitations of local feature matching in computer vision, specifically the inherent locality of descriptors that impedes robust cross-view correspondence. The framework leverages a hybrid paradigm combining implicit parallel attention mechanisms and explicit visibility estimation, thereby enabling scene-level awareness for feature matching tasks. Notably, SceneGlue does not require scene-level annotation or semantic supervision during training; instead, it is solely supervised by keypoint-level groundtruth matches. This enables superior generalization across diverse tasks such as homography estimation, pose estimation, image matching, and visual localization.

Methodological Innovations

SceneGlue introduces several methodological advances:

Informative Multi-Scale Feature Representation and Wave Position Encoding: The architecture utilizes a lightweight multi-scale feature network built on SuperPoint, sampling keypoints from multiple resolution feature maps to enhance stability under scale-varying conditions. For position encoding, SceneGlue replaces conventional MLP-based encoders with the Wave Position Encoder (Wave-PE), modeling descriptor-position relationships through amplitude and phase using the Euler formula. This approach demonstrably improves position-awareness in descriptors and outperforms MLP-based approaches given comparable parameter budgets.

Parallel Attention Architecture: Unlike prior works employing sequential self- and cross-attention (e.g., SuperGlue, LightGlue), SceneGlue arranges these attentions in parallel, enabling simultaneous intra- and inter-image interactions. This design promotes comprehensive context propagation and achieves greater representational fidelity for scene-aware features, while reducing redundant computations. Ablation results indicate performance gains in precision, recall, and F1-score when transitioning from serial to parallel attention.

Visibility Transformer for Explicit Scene Awareness: SceneGlue incorporates a Visibility Transformer, which takes both learnable scene descriptors and multi-scale local descriptors as input and predicts commonly visible regions across image pairs. The module uses customized spatial and channel MLPs to enhance representation, followed by a Transformer-style mapping between scene descriptors and local features, ultimately yielding visibility classification via binary cross-entropy loss. This explicit modeling of cross-view visibility is shown to improve both interpretability and local matching robustness.

Hybrid Loss Function: Training utilizes a hybrid loss, combining point-level correspondence estimation and scene-aware supervision. The design balances descriptor similarity and visibility estimation, and extensive hyper-parameter studies reveal optimal trade-offs that further boost performance.

Empirical Evaluation

SceneGlue’s empirical results substantiate its claims of superior performance and efficiency:

Image Matching (HPatches): SceneGlue achieves the highest mean matching accuracy (MMA) under most thresholds, outperforming LightGlue, SuperGlue, and SAM in challenging scenarios involving viewpoint and illumination changes. Its precision at stringent thresholds affirms the efficacy of scene-level guidance in matching.

Homography Estimation (R1M): SceneGlue outperforms established baselines and Transformer-based methods—including SuperGlue, LightGlue, and SAM—with an F1-score of 95.97%. It delivers notable improvements (+2.20% precision, +1.08% F1-score) over sequential attention approaches and outlier filtering.

Outdoor Pose Estimation (MegaDepth, YFCC100M): SceneGlue achieves competitive or best AUC across multiple error thresholds, particularly surpassing SuperGlue and SAM on both datasets. It outperforms ClusterGNN at higher thresholds, indicating robustness under significant pose variation.

Indoor Pose Estimation (ScanNet, InLoc): SceneGlue attains superior pose AUC under tight thresholds, and demonstrates strong recall compared to LightGlue, DiffGlue, and SuperGlue in realistic indoor environments.

Visual Localization (Aachen Day-Night): The method achieves leading results under multiple error tolerances for both daytime and nighttime queries, validating its robustness to extreme appearance changes.

Efficiency: With 11.2M parameters, SceneGlue is more parameter-efficient than most comparator models, with competitive FLOPs and latency. It maintains high accuracy while reducing computational overhead relative to dense matchers (LoFTR, ASpanFormer).

Ablation Studies: Each component—Wave-PE, parallel attention, multi-scale features, visibility estimation—contributes independent performance gains, as validated by F1-score increments. SceneGlue's parameter efficiency is further highlighted in comparative studies with MLP-PE and varying network sizes.

Implications and Future Directions

The SceneGlue framework demonstrates that explicit and implicit scene-level awareness, achieved without reliance on scene-level semantic annotations, can substantially enhance feature matching across multiple downstream tasks. The methodological contributions—in parallel attention and visibility estimation—enable more accurate, robust, and interpretable matching, providing a foundation for future extensions.

From a practical standpoint, SceneGlue’s improved robustness to viewpoint and illumination changes, as well as its efficiency, make it highly suitable for real-time SLAM, structure-from-motion, and large-scale localization tasks. The explicit visibility modeling has further implications for occlusion handling and dynamic scene reconstruction.

Theoretically, SceneGlue bridges local and global contexts in feature matching architectures, suggesting new avenues for integrating semantic segmentation, contextual reasoning, or multi-modal data. Incorporating semantic supervision or high-level scene understanding could enable even greater robustness under extreme conditions, as acknowledged by the authors.

Conclusion

SceneGlue delivers a scene-aware Transformer-based framework for local feature matching that achieves state-of-the-art performance and efficiency without requiring scene-level annotation. Its multi-scale informative representation, parallel attention design, and explicit visibility estimation collectively yield substantial improvements in accuracy, robustness, and interpretability for core vision tasks. Future research may further enhance its capabilities by integrating semantic cues, expanding its applicability to more complex and dynamic environments.

Markdown Report Issue