- The paper presents SceneGlue, a transformer that integrates implicit attention and explicit visibility estimation for scene-aware feature matching without scene-level annotations.
- It introduces a Wave Position Encoder and parallel attention architecture to enhance multi-scale feature representation and improve matching accuracy in diverse vision tasks.
- Empirical results on HPatches, R1M, MegaDepth, and Aachen datasets demonstrate superior performance and efficiency over existing state-of-the-art methods.
Overview
SceneGlue addresses the prevailing limitations of local feature matching in computer vision, specifically the inherent locality of descriptors that impedes robust cross-view correspondence. The framework leverages a hybrid paradigm combining implicit parallel attention mechanisms and explicit visibility estimation, thereby enabling scene-level awareness for feature matching tasks. Notably, SceneGlue does not require scene-level annotation or semantic supervision during training; instead, it is solely supervised by keypoint-level groundtruth matches. This enables superior generalization across diverse tasks such as homography estimation, pose estimation, image matching, and visual localization.
Methodological Innovations
SceneGlue introduces several methodological advances:
Informative Multi-Scale Feature Representation and Wave Position Encoding: The architecture utilizes a lightweight multi-scale feature network built on SuperPoint, sampling keypoints from multiple resolution feature maps to enhance stability under scale-varying conditions. For position encoding, SceneGlue replaces conventional MLP-based encoders with the Wave Position Encoder (Wave-PE), modeling descriptor-position relationships through amplitude and phase using the Euler formula. This approach demonstrably improves position-awareness in descriptors and outperforms MLP-based approaches given comparable parameter budgets.
Parallel Attention Architecture: Unlike prior works employing sequential self- and cross-attention (e.g., SuperGlue, LightGlue), SceneGlue arranges these attentions in parallel, enabling simultaneous intra- and inter-image interactions. This design promotes comprehensive context propagation and achieves greater representational fidelity for scene-aware features, while reducing redundant computations. Ablation results indicate performance gains in precision, recall, and F1-score when transitioning from serial to parallel attention.
Visibility Transformer for Explicit Scene Awareness: SceneGlue incorporates a Visibility Transformer, which takes both learnable scene descriptors and multi-scale local descriptors as input and predicts commonly visible regions across image pairs. The module uses customized spatial and channel MLPs to enhance representation, followed by a Transformer-style mapping between scene descriptors and local features, ultimately yielding visibility classification via binary cross-entropy loss. This explicit modeling of cross-view visibility is shown to improve both interpretability and local matching robustness.
Hybrid Loss Function: Training utilizes a hybrid loss, combining point-level correspondence estimation and scene-aware supervision. The design balances descriptor similarity and visibility estimation, and extensive hyper-parameter studies reveal optimal trade-offs that further boost performance.
Empirical Evaluation
SceneGlueโs empirical results substantiate its claims of superior performance and efficiency:
Image Matching (HPatches): SceneGlue achieves the highest mean matching accuracy (MMA) under most thresholds, outperforming LightGlue, SuperGlue, and SAM in challenging scenarios involving viewpoint and illumination changes. Its precision at stringent thresholds affirms the efficacy of scene-level guidance in matching.
Homography Estimation (R1M): SceneGlue outperforms established baselines and Transformer-based methodsโincluding SuperGlue, LightGlue, and SAMโwith an F1-score of 95.97%. It delivers notable improvements (+2.20% precision, +1.08% F1-score) over sequential attention approaches and outlier filtering.
Outdoor Pose Estimation (MegaDepth, YFCC100M): SceneGlue achieves competitive or best AUC across multiple error thresholds, particularly surpassing SuperGlue and SAM on both datasets. It outperforms ClusterGNN at higher thresholds, indicating robustness under significant pose variation.
Indoor Pose Estimation (ScanNet, InLoc): SceneGlue attains superior pose AUC under tight thresholds, and demonstrates strong recall compared to LightGlue, DiffGlue, and SuperGlue in realistic indoor environments.
Visual Localization (Aachen Day-Night): The method achieves leading results under multiple error tolerances for both daytime and nighttime queries, validating its robustness to extreme appearance changes.
Efficiency: With 11.2M parameters, SceneGlue is more parameter-efficient than most comparator models, with competitive FLOPs and latency. It maintains high accuracy while reducing computational overhead relative to dense matchers (LoFTR, ASpanFormer).
Ablation Studies: Each componentโWave-PE, parallel attention, multi-scale features, visibility estimationโcontributes independent performance gains, as validated by F1-score increments. SceneGlue's parameter efficiency is further highlighted in comparative studies with MLP-PE and varying network sizes.
Implications and Future Directions
The SceneGlue framework demonstrates that explicit and implicit scene-level awareness, achieved without reliance on scene-level semantic annotations, can substantially enhance feature matching across multiple downstream tasks. The methodological contributionsโin parallel attention and visibility estimationโenable more accurate, robust, and interpretable matching, providing a foundation for future extensions.
From a practical standpoint, SceneGlueโs improved robustness to viewpoint and illumination changes, as well as its efficiency, make it highly suitable for real-time SLAM, structure-from-motion, and large-scale localization tasks. The explicit visibility modeling has further implications for occlusion handling and dynamic scene reconstruction.
Theoretically, SceneGlue bridges local and global contexts in feature matching architectures, suggesting new avenues for integrating semantic segmentation, contextual reasoning, or multi-modal data. Incorporating semantic supervision or high-level scene understanding could enable even greater robustness under extreme conditions, as acknowledged by the authors.
Conclusion
SceneGlue delivers a scene-aware Transformer-based framework for local feature matching that achieves state-of-the-art performance and efficiency without requiring scene-level annotation. Its multi-scale informative representation, parallel attention design, and explicit visibility estimation collectively yield substantial improvements in accuracy, robustness, and interpretability for core vision tasks. Future research may further enhance its capabilities by integrating semantic cues, expanding its applicability to more complex and dynamic environments.