- The paper presents a novel relational formulation that constructs epipolar graphs from dense keypoint matches for robust relative camera pose estimation.
- It leverages graph neural networks with message passing and pooling to regress 7D pose parameters while enforcing explicit epipolar constraints.
- Experimental evaluations on benchmarks demonstrate enhanced accuracy, reduced drift, and better outlier suppression compared to traditional methods.
Relational Epipolar Graphs for Robust Relative Camera Pose Estimation
Introduction and Problem Statement
Relative camera pose estimation is fundamental to visual SLAM, multi-view geometry, and 3D scene reconstruction. Traditional pipelines, structured around minimal solvers and consensus schemes (e.g., RANSAC and its variants), face significant challenges under noise, wide baseline, and ambiguous correspondences, particularly in dense-matching regimes. Recent end-to-end deep models regress camera motion directly from image pairs, but often fail to enforce the explicit epipolar geometry or struggle to handle large outlier rates.
This paper, "Relational Epipolar Graphs for Robust Relative Camera Pose Estimation" (2604.04554), proposes reframing the classical problem as global relational inference on epipolar correspondence graphs. Here, dense keypoint matches from dense, detector-free matchers (LoFTR) are encoded as nodes in an augmented graph. Edges enforce spatial and geometric proximity, explicitly embedding the epipolar constraint. Graph neural network (GNN) architectures operate via message passing and pooling, resulting in robust relative rotation (quaternion), translation, and essential matrix (EM) estimation, all under composite, geometry-consistent loss functions.
Methodology
Epipolar Graph Construction and Relational Representation
Given an image pair, LoFTR is used to compute dense keypoint correspondences with associated confidence, which are normalized via intrinsics. Each matched point pair is represented as a node in an initial k-NN graph (spatial proximity, not just image proximity). Edges are further pruned using initial Sampson-error thresholds relative to a minimal EM fit, yielding nodes that better satisfy the epipolar constraint.
This results in a correspondence graph G=(V,E) whose node attributes are feature vectors encoding geometric and appearance properties. Edge construction (spatial, Sampson error, and possibly mutual/radius-based neighbor consistency as explored in ablations) is critical, directly impacting GNN aggregation and downstream pose performance.
Figure 1: Epipolar Graph VO Block Diagram - Includes Graph Construction to Relative Pose Regression.
Pose Parameter Regression via GNNs
On the constructed graph, the pose inference pipeline consists of several stacked message-passing GNN layers (GCN, GAT, GIN, EdgeCNN), optionally with attention, followed by a pooling operation to produce a global graph feature vector. This feature is mapped via an MLP to the 7D pose space (quaternion, translation), from which the EM is reconstructed via [t]×​R(q).
Supervision is geometry-coupled and multi-objective: direct pose regression loss (translation direction, scale, quaternion with antipodal handling), EM Frobenius and SVD loss, and heading/yaw loss for trajectory consistency.
Experimental Results
Benchmarks and Data Protocol
Experiments are conducted on diverse benchmarks: KITTI (outdoor, autonomous driving), King's College (urban, relocalization), TartanAir (aerial, challenging motion), and ETH3D (indoor/outdoor). Both consecutive and wide-baseline frame selection (up to 20 frames apart, inducing significant viewpoint change and correspondence sparsity) are used to evaluate robustness.
Metrics cover both relative pose (DTE/DRE) and absolute trajectory (APE/ATE, APE-R), with a focus on translation/rotation error resilience to drift and outlier matches.
Trajectory and Feature Embedding Analysis
3D trajectory visualizations confirm that GNN-based methods maintain motion consistency and reduced drift under both consecutive and wide-baseline regimes. In feature embedding analyses, t-SNE projections reveal that GNNs generate compact, well-clustered latent representations that align with geometrically consistent matches, whereas CNNs exhibit feature overlap and less separability.
Figure 3: t-SNE Plot of CNN Models.
Figure 4: t-SNE Plot of GNN models.
Runtime and Practical Implications
Graph-based inference is computationally efficient post-correspondence extraction, with GIN-based aggregations being particularly lightweight. The overhead is dominated by graph construction (currently CPU-based); further optimization (e.g., full GPU-acceleration) is likely to yield real-time performance suitable for SLAM applications.
Ablation Studies
Alternative neighbor definitions (k-NN, soft-k-NN, radius, mutual neighbors) slightly impact pose estimation accuracy, with some improvement in outlier suppression and geometric robustness for mutual and radius-based graphs. The optimal strategy remains data-dependent and tied to the correspondence regime.
Feature visualizations before and after pooling empirically demonstrate the increased relational structure captured by GNNs versus CNNs, corroborating the observed reduction in pose estimation variance and drift.
Implications and Future Directions
This relational formulation unifies geometric consistency with deep relational modeling for robust pose inference. By embedding epipolar constraints directly in the graph structure and employing GNNs for global consensus, the approach avoids the brittle stochasticity of classical consensus pipelines (e.g., RANSAC) and addresses failure modes of black-box regression, particularly under wide-baseline or ambiguous matches.
Theoretical implications include an explicit bridging between multi-view geometry and spectral graph inference, where GNN message passing approximates nullspace selection for EM estimation—a perspective that opens new avenues for geometric deep learning in vision.
Practically, the method is well-suited for SLAM and visual odometry systems, especially in scenarios with dense matching and variable quality. With efficient graph construction and integration into correspondence pipelines, such systems can achieve high accuracy with low latency, robust to challenging environmental variations.
Directions for Advancement
- Joint optimization of correspondence estimation and pose inference (end-to-end learning including the matcher).
- Extension to multi-view pose graph optimization and more general structure-from-motion scenarios.
- Efficient and scalable graph construction for real-time deployment, possibly with learned or adaptive edge pruning.
- Incorporation of uncertainty modeling and temporal consistency for long-term SLAM robustness.
Conclusion
The paper establishes that casting relative pose estimation as relational inference over epipolar graphs—and leveraging GNN-based architectures—yields resilient, geometrically grounded pose predictions that outperform both consensus-based and purely image-based regression methods, particularly under correspondence noise and wide baseline conditions. The explicit encoding of geometry within graph learning provides theoretical clarity and practical robustness, setting the foundation for further advances in geometric deep learning for SLAM and 3D vision systems.