Consensus Learning with Deep Sets for Essential Matrix Estimation (2406.17414v2)

Published 25 Jun 2024 in cs.CV

Abstract: Robust estimation of the essential matrix, which encodes the relative position and orientation of two cameras, is a fundamental step in structure from motion pipelines. Recent deep-based methods achieved accurate estimation by using complex network architectures that involve graphs, attention layers, and hard pruning steps. Here, we propose a simpler network architecture based on Deep Sets. Given a collection of point matches extracted from two images, our method identifies outlier point matches and models the displacement noise in inlier matches. A weighted DLT module uses these predictions to regress the essential matrix. Our network achieves accurate recovery that is superior to existing networks with significantly more complex architectures.

Authors (6)

Dror Moran (4 papers)
Yuval Margalit (3 papers)
Guy Trostianetsky (1 paper)
Fadi Khatib (3 papers)
Meirav Galun (27 papers)
Ronen Basri (42 papers)

Summary

Consensus Learning with Deep Sets for Essential Matrix Estimation

Moran et al. address the challenging problem of estimating the essential matrix, a fundamental task in the Structure from Motion (SfM) and Simultaneous Localization and Mapping (SLAM) pipelines. The essential matrix encapsulates the relative rotation and translation between two cameras and is pivotal for understanding the spatial relationship between the views they capture. Traditional approaches that utilize methods like RANSAC, although robust, often fall short in environments laden with high outlier ratios and substantial noise.

Key Contributions

NACNet: Noise Aware Consensus Network - Moran et al. introduce NACNet, a simplified yet effective neural network architecture for consensus learning.
Deep Sets Based Architecture - Instead of relying on complex structures like graph networks or costly attention mechanisms, NACNet leverages the Deep Sets framework for achieving permutation invariance.
Inlier Displacement Error Estimation - The network can predict the displacement noise of inlier keypoints, significantly enhancing robustness.
Effective Noise-Free Pretraining Scheme - The authors suggest a pretraining strategy on denoised data before training on noisy data, improving the accuracy of the estimated essential matrix.

Methodology

The NACNet architecture is characterized by its modular structure comprising Noise Aware Consensus (NAC) blocks. Each NAC block contains a set encoder following the Deep Sets framework, alongside a noise regression module and a classification module. The architecture’s parameters are updated via a multi-term loss function that integrates classification error, model error, and noise prediction error.

Detailed Blocks:

Set Encoder: Utilizes shared element-wise layers to generate global features, ensuring permutation invariance.
Noise Regression Module: Predicts positional noise to denoise inlier keypoints.
Classification Head: Differentiates between inliers and outliers, aiding accurate consensus set formation.

Training Strategy:

The training process includes a critical two-stage noise-aware optimization scheme. The network is first trained on noiseless datasets, significantly improving its ability to learn robust features before being fine-tuned on noisy real-world data.

Numerical Results

NACNet demonstrates superior performance compared to contemporary methods. Specifically, in tests on the YFCC and SUN3D datasets, NACNet outperforms methods like CLNet, NCMNet, MGNet, and BCLNet across various keypoint descriptors (SIFT, SuperPoint). For instance:

On YFCC with SIFT, NACNet achieves a mAP of 66.14% in cross-scene evaluation, outperforming the best baseline by approximately 0.06%.
On SUN3D with SuperPoint, NACNet achieves a mAP of 24.67% in cross-scene testing, marking an improvement of 2.96% over competing techniques.

Theoretical and Practical Implications

Theoretically, the utilization of Deep Sets affirms the expressive capacity of permutation-equivariant functions for consensus tasks in computer vision. Practically, NACNet’s architecture simplifies the essential matrix estimation pipeline, ensuring it can handle substantial outlier ratios and noisy data with minimal computational overhead.

Future Directions

Future research could extend NACNet’s applicability in varying conditions, including different keypoint detectors and diverse environmental scenarios:

Cross-Descriptor Generalization: Although NACNet shows initial promise across different keypoint descriptors, more work is necessary to ensure consistency.
Multiview SfM Pipelines Integration: Incorporating NACNet into comprehensive multi-view SfM pipelines could enhance end-to-end performance.
Degeneracy Handling Mechanisms: Enhancing the model with degeneracy tests could further refine the use of non-degenerate configurations, improving the robustness of geometric model estimation.

In summary, Moran et al. propose a compelling, simpler approach to a notoriously difficult problem in computer vision, contributing to both theoretical understanding and practical advancements in essential matrix estimation. The introduction of NACNet paves the way for more resilient and efficient SfM and SLAM systems.

Related Papers

Find Related Papers

Tweets

https://twitter.com/zhenjun_zhao/status/1805862608790061527

https://twitter.com/DrorMoran/status/1861376237177893158