- The paper presents DEVO, a novel monocular event-only VO system that significantly reduces camera pose tracking error on real-world benchmarks.
- It employs a deep learning network to select and track informative event patches, refining optical flow predictions for improved accuracy.
- Evaluation across seven benchmarks shows DEVO generalizes from simulation to real-world conditions, rivaling methods that use additional sensors.
Introduction
Event cameras are sensors that differ significantly from traditional cameras, capturing pixel-level brightness changes asynchronously with high temporal resolution and dynamic range. These unique attributes open up promising avenues for tracking camera movement, known as visual odometry (VO), especially during high-speed motion or in challenging lighting conditions. However, the current landscape of event-based VO approaches, even though robust in adverse conditions, still struggles with performance limitations and often relies on additional sensory inputs like inertial measurement units (IMUs), stereo vision, or frame-based cameras to achieve satisfactory results. These additional inputs complicate systems, increase costs, and expose them to issues such as motion blur and reduced dynamic range.
Towards Monocular Event-Only Visual Odometry
To address these challenges, this research introduces Deep Event Visual Odometry (DEVO), a monocular, event-only VO system designed to work robustly across various real-world benchmarks. DEVO innovatively selects and tracks certain event "patches" over time, using a deep-learning approach for patch selection optimized for event data. This method enables DEVO to reduce pose tracking error substantially on several real-world benchmarks, often rivaling or surpassing the performance of stereo or inertial methods without the need for additional sensors.
Deep Event Visual Odometry (DEVO)
DEVO associates visual events with relevant patches for tracking, leveraging a tailored neural network that predicts which areas of the data are most promising for accurate VO. The system estimates camera poses and depths from sequences of event data using an iterative process that refines optical flow predictions and adjusts the event patches' trajectories. A significant part of DEVO is its patch selection network, developed specifically for the sparse and temporally-rich nature of event camera data.
The network architecture is trained on a large simulated dataset and evaluated against multiple real-world benchmarks, surpassing previous event-based methods in accuracy and robustness. Additionally, DEVO integrates photometric voxel augmentations, compensating for the sim-to-real gap that arises from the idealized event generation models used in simulation.
Evaluation and Open-Source Contributions
The evaluation of DEVO across seven real-world benchmarks highlights its capacity to generalize from simulation to a diverse array of real-world conditions without extensive parameter tuning. The evaluation reveals that DEVO often outperforms comparable methods that use additional sensors like IMUs or stereo cameras, demonstrating the efficacy of learning-based VO with event data.
To contribute to the research community and encourage further advancements in event-based vision, the authors of the paper have made their code, including training, evaluation, and event data generation, accessible as open-source. This transparency ensures that future research can build upon their findings, optimizing and developing this innovative approach to visual odometry further.