Papers
Topics
Authors
Recent
2000 character limit reached

SC-Net: Two-View Correspondence Network

Updated 5 January 2026
  • SC-Net is a deep learning architecture for two-view correspondence that integrates bilateral spatial and channel context to produce robust motion field estimates.
  • It incorporates specialized modules including AFR, BFA, and PAR to enhance spatial localization, inlier robustness, and precise motion vector recovery.
  • SC-Net's modular design achieves state-of-the-art performance in relative pose estimation and outlier removal, advancing robust 2D/3D matching in complex scenes.

SC-Net is a deep learning architecture for two-view correspondence learning that integrates bilateral context from both spatial and channel dimensions to produce robust, accurate motion fields in relative pose estimation and outlier removal tasks. SC-Net addresses limitations of standard convolutional neural network (CNN) backbones, which may insufficiently aggregate global context and oversmooth dense motion fields, especially in scenes with large disparity. The architecture introduces a sequence of specialized modules, most notably the Adaptive Focused Regularization (AFR) module, a Bilateral Field Adjustment (BFA) module, and a Position-Aware Recovery (PAR) module, each contributing to precise and context-aware motion field estimation (Lin et al., 29 Dec 2025).

1. Architectural Structure and Components

SC-Net comprises a stack of LL rectifying layers, each containing three major submodules: AFR, BFA, and PAR. The computational pipeline starts with “unordered” motion features Ml1RN×CM^{l-1} \in \mathbb{R}^{N\times C}—where NN is the number of putative matches and CC the feature dimension—and fixed grid embeddings GRK2×CG \in \mathbb{R}^{K^2\times C}, corresponding to a K×KK\times K spatial grid.

  • Adaptive Focused Regularization (AFR): The initial sub-module within each rectifying layer, AFR transforms Ml1M^{l-1} and GG into a sparse, position-sensitive motion field FlRK2×CF^l \in \mathbb{R}^{K^2\times C}. It implements a multi-head graph attention mechanism, combining position-aware attention and soft filtering to enhance spatial localization and inlier robustness.
  • Bilateral Field Adjustment (BFA): Refines the motion field FlF^l by simultaneously modeling interactions across spatial and channel dimensions, capturing long-range dependencies and facilitating cross-context information exchange.
  • Position-Aware Recovery (PAR): Recovers final motion vectors from the refined field, enforcing consistency and precision through explicit spatial referencing.

This modular structure enables each grid cell to selectively incorporate global and local context, supporting SC-Net’s efficacy in highly variable geometric configurations.

2. Adaptive Focused Regularization Module

The AFR module serves as the core innovation in SC-Net’s correspondence reasoning stack. For each rectifying layer:

  • Input:
    • Motion features Ml1RN×CM^{l-1} \in \mathbb{R}^{N\times C}
    • Grid embeddings GRK2×CG \in \mathbb{R}^{K^2\times C}
    • Normalized keypoint coordinates XRN×2X \in \mathbb{R}^{N\times 2}, YRK2×2Y \in \mathbb{R}^{K^2\times 2}
    • Previous-layer inlier logits z^clsl1RN\hat z_{cls}^{l-1} \in \mathbb{R}^{N}
  • Operation:
    • Graph Attention Backbone: Implements multi-head attention with HH heads, where each head computes query (QQ), key (KK), and value (VV) projections from GG and Ml1M^{l-1}.
    • Position-Aware Bias: Constructs a positional correlation matrix S=ΦYΦXRK2×NS = \Phi_Y \Phi_X^\top \in \mathbb{R}^{K^2\times N}, where ΦX\Phi_X and ΦY\Phi_Y are shared MLP embeddings of XX and YY.
    • Augmented Attention Logits: Each attention head’s score is modulated by Bi=ψ(αiS/C+βi)B_i = \psi(\alpha_i\, S/\sqrt{C} + \beta_i), adding explicit spatial relationships (with learnable αi\alpha_i, βi\beta_i and ψ=\psi=LeakyReLU).
    • Soft Filtering: Values are weighted by inlier probability p=σ(z^cls)p=\sigma(\hat z_{cls}) via Z^=diag(p)Ẑ = \mathrm{diag}(p), reducing the influence of outliers.
    • Output: Concatenated multi-head outputs yield Fl=Concati(Oi)WOF^l = \mathrm{Concat}_i(O_i) W^O, ready for further refinement.

This enhances position-awareness and spatial selectivity, directly counteracting the oversmoothing issues present in vanilla GAT and CNN-based approaches.

3. Training Objective and Optimization

SC-Net employs a joint loss on each rectifying layer:

L=l=0L1[Lcls(z^clsl,zcls)+λLreg(E^l,E)]\mathcal{L} = \sum_{l=0}^{L-1} \left[ \mathcal{L}_{cls}(\hat z_{cls}^l, z_{cls}) + \lambda\,\mathcal{L}_{reg}(\hat E^l, E) \right]

  • Lcls\mathcal{L}_{cls}: Binary cross-entropy for correspondence classification, with adaptive temperature τ\tau.
  • Lreg\mathcal{L}_{reg}: Regression loss aligning the predicted essential matrix E^l\hat E^l with ground truth EE.
  • λ\lambda schedule: Ramps from 00.50\to 0.5 after 20k training steps.
  • Optimization: ADAM optimizer with initial learning rate 10410^{-4}.
  • No auxiliary loss is assigned directly to AFR; improvements in loss are observed end-to-end through enhanced correspondence and motion estimation.

4. Hyper-Parameterization

Ablative analysis identifies optimal settings as:

Parameter Value / Range Effect
Grid size KK 16 (K2=256K^2=256 cells) Balances spatial granularity and computational cost
Rectifying layers LL 6 Empirically optimal vs. L=4,8L=4,8
Attention heads HH 4 Per-head dimension C/HC/H
Position MLP F3\mathcal{F}_3 2-layer, ReLU, out dim CC For positional embedding of coordinates
LeakyReLU slope 0.2 Nonlinearity in positional bias

Scaling of SS by 1/C1/\sqrt{C} and soft filtering via sigmoid are critical tunings for attention stability and inlier weighting.

5. Empirical Results and Ablation

On the YFCC100M dataset (known scenes), SC-Net demonstrates state-of-the-art performance in correspondence classification (mAP@5mAP@5^\circ):

  • Baseline (ConvMatch, no AFR/BFA): 46.09
  • + HED (hierarchical encoder-decoder, BFA only): 57.96
  • + MFM (motion-feature modulator): 59.60
  • + SF (soft filtering in AFR): 61.96
  • + SF + PA (full AFR): 64.35

The data identifies two main sources of improvement in AFR:

  • Soft Filtering (SF): +2.36 mAPmAP over the prior step.
  • Position-Aware Attention (PA): Additional +2.39 mAPmAP.

Collectively, these yield a cumulative >17>17 mAPmAP advantage over the unregularized baseline. This highlights the importance of both spatially explicit attention mechanisms and inlier weighting for correspondence robustness.

6. Design Rationale and Context within Correspondence Learning

Graph attention enables flexible information exchange between all matches and spatial locations but, in default form, is prone to spatial mixing and loss of locality. AFR’s position-aware bias enforces correspondence between spatial grid points and candidate matches based on geometric consistency, sharply localizing the attention. The soft filtering mechanism exploits intermediate classifier logits to suppress the effect of spurious, low-confidence motion samples, resulting in a more robust, outlier-tolerant estimate. By stacking these modules, SC-Net produces motion fields that retain high-frequency detail and spatial discontinuity, critical for challenging geometric scenes with large disparity or complex photometric differences.

The BFA and PAR modules further process the output of AFR, modeling joint spatial-channel context and enabling accurate recovery of motion vectors. The end-to-end design supports robust training and generalization without requiring dense supervision beyond keypoint correspondences and essential matrix annotation.

7. Applications and Extensions

SC-Net is benchmarked for relative pose estimation and outlier removal across large-scale correspondence datasets (YFCC100M, SUN3D). Its general design—bilateral spatial/channel context integration with explicit geometric localization—renders it suitable for problems in robotics, SLAM, structure from motion, and robust 2D/3D matching, especially where standard CNN or transformer methods struggle with global context aggregation and spatial discontinuity.

A plausible implication is that the AFR strategy—separately modeling positional and semantic relationships, together with confidence-weighted filtering—may be extensible to broader scene understanding or geometric reasoning tasks beyond correspondence learning, where spatial precision is crucial (Lin et al., 29 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SC-Net.