Relational Epipolar Graphs for Robust Relative Camera Pose Estimation

Published 6 Apr 2026 in cs.CV and cs.RO | (2604.04554v1)

Abstract: A key component of Visual Simultaneous Localization and Mapping (VSLAM) is estimating relative camera poses using matched keypoints. Accurate estimation is challenged by noisy correspondences. Classical methods rely on stochastic hypothesis sampling and iterative estimation, while learning-based methods often lack explicit geometric structure. In this work, we reformulate relative pose estimation as a relational inference problem over epipolar correspondence graphs, where matched keypoints are nodes and nearby ones are connected by edges. Graph operations such as pruning, message passing, and pooling estimate a quaternion rotation, translation vector, and the Essential Matrix (EM). Minimizing a loss comprising (i) $\mathcal{L}_2$ differences with ground truth (GT), (ii) Frobenius norm between estimated and GT EMs, (iii) singular value differences, (iv) heading angle differences, and (v) scale differences, yields the relative pose between image pairs. The dense detector-free method LoFTR is used for matching. Experiments on indoor and outdoor benchmarks show improved robustness to dense noise and large baseline variation compared to classical and learning-guided approaches, highlighting the effectiveness of global relational consensus.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper presents a novel relational formulation that constructs epipolar graphs from dense keypoint matches for robust relative camera pose estimation.
It leverages graph neural networks with message passing and pooling to regress 7D pose parameters while enforcing explicit epipolar constraints.
Experimental evaluations on benchmarks demonstrate enhanced accuracy, reduced drift, and better outlier suppression compared to traditional methods.

Relational Epipolar Graphs for Robust Relative Camera Pose Estimation

Introduction and Problem Statement

Relative camera pose estimation is fundamental to visual SLAM, multi-view geometry, and 3D scene reconstruction. Traditional pipelines, structured around minimal solvers and consensus schemes (e.g., RANSAC and its variants), face significant challenges under noise, wide baseline, and ambiguous correspondences, particularly in dense-matching regimes. Recent end-to-end deep models regress camera motion directly from image pairs, but often fail to enforce the explicit epipolar geometry or struggle to handle large outlier rates.

This paper, "Relational Epipolar Graphs for Robust Relative Camera Pose Estimation" (2604.04554), proposes reframing the classical problem as global relational inference on epipolar correspondence graphs. Here, dense keypoint matches from dense, detector-free matchers (LoFTR) are encoded as nodes in an augmented graph. Edges enforce spatial and geometric proximity, explicitly embedding the epipolar constraint. Graph neural network (GNN) architectures operate via message passing and pooling, resulting in robust relative rotation (quaternion), translation, and essential matrix (EM) estimation, all under composite, geometry-consistent loss functions.

Methodology

Epipolar Graph Construction and Relational Representation

Given an image pair, LoFTR is used to compute dense keypoint correspondences with associated confidence, which are normalized via intrinsics. Each matched point pair is represented as a node in an initial $k$ -NN graph (spatial proximity, not just image proximity). Edges are further pruned using initial Sampson-error thresholds relative to a minimal EM fit, yielding nodes that better satisfy the epipolar constraint.

This results in a correspondence graph $G=(V, E)$ whose node attributes are feature vectors encoding geometric and appearance properties. Edge construction (spatial, Sampson error, and possibly mutual/radius-based neighbor consistency as explored in ablations) is critical, directly impacting GNN aggregation and downstream pose performance.

Figure 1: Epipolar Graph VO Block Diagram - Includes Graph Construction to Relative Pose Regression.

Pose Parameter Regression via GNNs

On the constructed graph, the pose inference pipeline consists of several stacked message-passing GNN layers (GCN, GAT, GIN, EdgeCNN), optionally with attention, followed by a pooling operation to produce a global graph feature vector. This feature is mapped via an MLP to the 7D pose space (quaternion, translation), from which the EM is reconstructed via $[\mathbf{t}]_\times\mathbf{R}(\mathbf{q})$ .

Supervision is geometry-coupled and multi-objective: direct pose regression loss (translation direction, scale, quaternion with antipodal handling), EM Frobenius and SVD loss, and heading/yaw loss for trajectory consistency.

Experimental Results

Benchmarks and Data Protocol

Experiments are conducted on diverse benchmarks: KITTI (outdoor, autonomous driving), King's College (urban, relocalization), TartanAir (aerial, challenging motion), and ETH3D (indoor/outdoor). Both consecutive and wide-baseline frame selection (up to 20 frames apart, inducing significant viewpoint change and correspondence sparsity) are used to evaluate robustness.

Metrics cover both relative pose (DTE/DRE) and absolute trajectory (APE/ATE, APE-R), with a focus on translation/rotation error resilience to drift and outlier matches.

Numerical Results and Performance Trends

On TartanAir and King's College datasets, graph-based models (notably GAT+2GCN, 3GCN+GAT, and GIN-SumPool) achieve the best results under both translation and rotation error, outperforming classical PoseNet/RPNet and other CNN-based regressors. This performance gap widens with increased motion baseline and correspondence ambiguity.
On KITTI, for both consecutive and wide-baseline evaluations (5/10 frames apart), deep GNN architectures yield lower ATE/APE and DTE/DRE compared to image-based regressors, especially under moderate viewpoint changes. However, in sequences with extremely sparse epipolar inliers due to baseline or repetitive structure, even the GNN's performance degrades, underscoring the ablation findings on the necessity of reliable graph construction and edge selection.
Figure 2: Matched keypoints graph between frames $i$ and $i+10$ as a 2-D t-SNE plot.

Trajectory and Feature Embedding Analysis

3D trajectory visualizations confirm that GNN-based methods maintain motion consistency and reduced drift under both consecutive and wide-baseline regimes. In feature embedding analyses, t-SNE projections reveal that GNNs generate compact, well-clustered latent representations that align with geometrically consistent matches, whereas CNNs exhibit feature overlap and less separability.

Figure 3: t-SNE Plot of CNN Models.

Figure 4: t-SNE Plot of GNN models.

Runtime and Practical Implications

Graph-based inference is computationally efficient post-correspondence extraction, with GIN-based aggregations being particularly lightweight. The overhead is dominated by graph construction (currently CPU-based); further optimization (e.g., full GPU-acceleration) is likely to yield real-time performance suitable for SLAM applications.

Ablation Studies

Alternative neighbor definitions ( $k$ -NN, soft- $k$ -NN, radius, mutual neighbors) slightly impact pose estimation accuracy, with some improvement in outlier suppression and geometric robustness for mutual and radius-based graphs. The optimal strategy remains data-dependent and tied to the correspondence regime.

Feature visualizations before and after pooling empirically demonstrate the increased relational structure captured by GNNs versus CNNs, corroborating the observed reduction in pose estimation variance and drift.

Implications and Future Directions

This relational formulation unifies geometric consistency with deep relational modeling for robust pose inference. By embedding epipolar constraints directly in the graph structure and employing GNNs for global consensus, the approach avoids the brittle stochasticity of classical consensus pipelines (e.g., RANSAC) and addresses failure modes of black-box regression, particularly under wide-baseline or ambiguous matches.

Theoretical implications include an explicit bridging between multi-view geometry and spectral graph inference, where GNN message passing approximates nullspace selection for EM estimation—a perspective that opens new avenues for geometric deep learning in vision.

Practically, the method is well-suited for SLAM and visual odometry systems, especially in scenarios with dense matching and variable quality. With efficient graph construction and integration into correspondence pipelines, such systems can achieve high accuracy with low latency, robust to challenging environmental variations.

Directions for Advancement

Joint optimization of correspondence estimation and pose inference (end-to-end learning including the matcher).
Extension to multi-view pose graph optimization and more general structure-from-motion scenarios.
Efficient and scalable graph construction for real-time deployment, possibly with learned or adaptive edge pruning.
Incorporation of uncertainty modeling and temporal consistency for long-term SLAM robustness.

Conclusion

The paper establishes that casting relative pose estimation as relational inference over epipolar graphs—and leveraging GNN-based architectures—yields resilient, geometrically grounded pose predictions that outperform both consensus-based and purely image-based regression methods, particularly under correspondence noise and wide baseline conditions. The explicit encoding of geometry within graph learning provides theoretical clarity and practical robustness, setting the foundation for further advances in geometric deep learning for SLAM and 3D vision systems.

Markdown Report Issue