SC-Net: Two-View Correspondence Network
- SC-Net is a deep learning architecture for two-view correspondence that integrates bilateral spatial and channel context to produce robust motion field estimates.
- It incorporates specialized modules including AFR, BFA, and PAR to enhance spatial localization, inlier robustness, and precise motion vector recovery.
- SC-Net's modular design achieves state-of-the-art performance in relative pose estimation and outlier removal, advancing robust 2D/3D matching in complex scenes.
SC-Net is a deep learning architecture for two-view correspondence learning that integrates bilateral context from both spatial and channel dimensions to produce robust, accurate motion fields in relative pose estimation and outlier removal tasks. SC-Net addresses limitations of standard convolutional neural network (CNN) backbones, which may insufficiently aggregate global context and oversmooth dense motion fields, especially in scenes with large disparity. The architecture introduces a sequence of specialized modules, most notably the Adaptive Focused Regularization (AFR) module, a Bilateral Field Adjustment (BFA) module, and a Position-Aware Recovery (PAR) module, each contributing to precise and context-aware motion field estimation (Lin et al., 29 Dec 2025).
1. Architectural Structure and Components
SC-Net comprises a stack of rectifying layers, each containing three major submodules: AFR, BFA, and PAR. The computational pipeline starts with “unordered” motion features —where is the number of putative matches and the feature dimension—and fixed grid embeddings , corresponding to a spatial grid.
- Adaptive Focused Regularization (AFR): The initial sub-module within each rectifying layer, AFR transforms and into a sparse, position-sensitive motion field . It implements a multi-head graph attention mechanism, combining position-aware attention and soft filtering to enhance spatial localization and inlier robustness.
- Bilateral Field Adjustment (BFA): Refines the motion field by simultaneously modeling interactions across spatial and channel dimensions, capturing long-range dependencies and facilitating cross-context information exchange.
- Position-Aware Recovery (PAR): Recovers final motion vectors from the refined field, enforcing consistency and precision through explicit spatial referencing.
This modular structure enables each grid cell to selectively incorporate global and local context, supporting SC-Net’s efficacy in highly variable geometric configurations.
2. Adaptive Focused Regularization Module
The AFR module serves as the core innovation in SC-Net’s correspondence reasoning stack. For each rectifying layer:
- Input:
- Motion features
- Grid embeddings
- Normalized keypoint coordinates ,
- Previous-layer inlier logits
- Operation:
- Graph Attention Backbone: Implements multi-head attention with heads, where each head computes query (), key (), and value () projections from and .
- Position-Aware Bias: Constructs a positional correlation matrix , where and are shared MLP embeddings of and .
- Augmented Attention Logits: Each attention head’s score is modulated by , adding explicit spatial relationships (with learnable , and LeakyReLU).
- Soft Filtering: Values are weighted by inlier probability via , reducing the influence of outliers.
- Output: Concatenated multi-head outputs yield , ready for further refinement.
This enhances position-awareness and spatial selectivity, directly counteracting the oversmoothing issues present in vanilla GAT and CNN-based approaches.
3. Training Objective and Optimization
SC-Net employs a joint loss on each rectifying layer:
- : Binary cross-entropy for correspondence classification, with adaptive temperature .
- : Regression loss aligning the predicted essential matrix with ground truth .
- schedule: Ramps from after 20k training steps.
- Optimization: ADAM optimizer with initial learning rate .
- No auxiliary loss is assigned directly to AFR; improvements in loss are observed end-to-end through enhanced correspondence and motion estimation.
4. Hyper-Parameterization
Ablative analysis identifies optimal settings as:
| Parameter | Value / Range | Effect |
|---|---|---|
| Grid size | 16 ( cells) | Balances spatial granularity and computational cost |
| Rectifying layers | 6 | Empirically optimal vs. |
| Attention heads | 4 | Per-head dimension |
| Position MLP | 2-layer, ReLU, out dim | For positional embedding of coordinates |
| LeakyReLU slope | 0.2 | Nonlinearity in positional bias |
Scaling of by and soft filtering via sigmoid are critical tunings for attention stability and inlier weighting.
5. Empirical Results and Ablation
On the YFCC100M dataset (known scenes), SC-Net demonstrates state-of-the-art performance in correspondence classification ():
- Baseline (ConvMatch, no AFR/BFA): 46.09
- + HED (hierarchical encoder-decoder, BFA only): 57.96
- + MFM (motion-feature modulator): 59.60
- + SF (soft filtering in AFR): 61.96
- + SF + PA (full AFR): 64.35
The data identifies two main sources of improvement in AFR:
- Soft Filtering (SF): +2.36 over the prior step.
- Position-Aware Attention (PA): Additional +2.39 .
Collectively, these yield a cumulative advantage over the unregularized baseline. This highlights the importance of both spatially explicit attention mechanisms and inlier weighting for correspondence robustness.
6. Design Rationale and Context within Correspondence Learning
Graph attention enables flexible information exchange between all matches and spatial locations but, in default form, is prone to spatial mixing and loss of locality. AFR’s position-aware bias enforces correspondence between spatial grid points and candidate matches based on geometric consistency, sharply localizing the attention. The soft filtering mechanism exploits intermediate classifier logits to suppress the effect of spurious, low-confidence motion samples, resulting in a more robust, outlier-tolerant estimate. By stacking these modules, SC-Net produces motion fields that retain high-frequency detail and spatial discontinuity, critical for challenging geometric scenes with large disparity or complex photometric differences.
The BFA and PAR modules further process the output of AFR, modeling joint spatial-channel context and enabling accurate recovery of motion vectors. The end-to-end design supports robust training and generalization without requiring dense supervision beyond keypoint correspondences and essential matrix annotation.
7. Applications and Extensions
SC-Net is benchmarked for relative pose estimation and outlier removal across large-scale correspondence datasets (YFCC100M, SUN3D). Its general design—bilateral spatial/channel context integration with explicit geometric localization—renders it suitable for problems in robotics, SLAM, structure from motion, and robust 2D/3D matching, especially where standard CNN or transformer methods struggle with global context aggregation and spatial discontinuity.
A plausible implication is that the AFR strategy—separately modeling positional and semantic relationships, together with confidence-weighted filtering—may be extensible to broader scene understanding or geometric reasoning tasks beyond correspondence learning, where spatial precision is crucial (Lin et al., 29 Dec 2025).