Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Deep Visual and Inertial Odometry with Adaptive Visual Modality Selection (2205.06187v2)

Published 12 May 2022 in cs.CV

Abstract: In recent years, deep learning-based approaches for visual-inertial odometry (VIO) have shown remarkable performance outperforming traditional geometric methods. Yet, all existing methods use both the visual and inertial measurements for every pose estimation incurring potential computational redundancy. While visual data processing is much more expensive than that for the inertial measurement unit (IMU), it may not always contribute to improving the pose estimation accuracy. In this paper, we propose an adaptive deep-learning based VIO method that reduces computational redundancy by opportunistically disabling the visual modality. Specifically, we train a policy network that learns to deactivate the visual feature extractor on the fly based on the current motion state and IMU readings. A Gumbel-Softmax trick is adopted to train the policy network to make the decision process differentiable for end-to-end system training. The learned strategy is interpretable, and it shows scenario-dependent decision patterns for adaptive complexity reduction. Experiment results show that our method achieves a similar or even better performance than the full-modality baseline with up to 78.8% computational complexity reduction for KITTI dataset evaluation. The code is available at https://github.com/mingyuyng/Visual-Selective-VIO.

Citations (23)

Summary

  • The paper introduces a dynamic policy network that selectively deactivates the visual encoder based on motion state, achieving up to 78.8% reduction in computational load.
  • The paper employs the Gumbel-Softmax trick to maintain differentiability, enabling end-to-end training that balances efficiency with pose estimation accuracy.
  • The paper validates its method on the KITTI dataset, demonstrating competitive translational and rotational accuracy despite significant GFLOPS reduction.

Efficient Deep Visual and Inertial Odometry with Adaptive Visual Modality Selection

This paper presents a novel approach to enhancing the efficiency of deep learning-based visual-inertial odometry (VIO) systems through adaptive visual modality selection. The focus is on reducing the computational overhead inherent in existing VIO methods by strategically controlling the use of visual data processing, which typically imposes a higher computational burden than processing inertial measurement unit (IMU) data.

Approach and Methodology

The crux of the proposed method lies in its ability to dynamically deactivate the visual encoder during the pose estimation process without significantly compromising the accuracy of the results. This is achieved by employing a policy network that learns when visual data is essential for accurate pose estimation based on the current motion state and IMU readings. A Gumbel-Softmax trick is employed to maintain the differentiability of the decision process, enabling end-to-end training of the entire system.

During the training phase, the policy network optimizes for a balance between computational efficiency and pose estimation accuracy by imposing a penalty factor, λ\lambda, on visual encoder usage. Across various configurations, the method achieves significant reductions in computational intensity as measured in GFLOPS, with reductions up to 78.8% compared to a full-modality baseline model.

Experimental Results

The proposed model is subjected to rigorous evaluations on the KITTI Odometry dataset. With different penalties on the visual encoder's use, various configurations of the model are compared against a baseline where the visual modality is always active. The findings illustrate that with λ=3×105\lambda=3\times10^{-5}, the model can cut down computational requirements by 78.8% while maintaining comparable accuracy in translational RMSE and achieving a noteworthy improvement in rotational RMSE.

Additionally, the method's ability to outperform alternative strategies, such as regular skipping and random sampling policies, underscores its superiority. The proposed approach not only optimizes computation without major accuracy trade-offs, but also demonstrates robust performance against other leading conventional and learning-based VIO methods.

Policy Interpretation

The learned policy manifests as interpretable behavior, enabling insights into decision-making patterns across varied motion states. For instance, the paper shows a tendency for the visual modality to be more often invoked during high-speed, straight movement phases, whereas during slower, turning actions, reliance on purely inertial data becomes more prevalent. Such dynamic adaptability reflects the inherent ability to optimize computational resources without sacrificing accuracy.

Practical and Theoretical Implications

Practically, this method presents a significant advancement for resource-limited mobile platforms, offering an efficient way to manage computational demands and power use while sustaining reliable odometric performance. This advancement enables wider applicability of deep learning-based VIO systems in real-world scenarios, particularly in mobile robotics and autonomous navigation domains.

Theoretically, this work enriches the body of research on adaptive inference techniques across neural networks, extending their applicability beyond static computer vision tasks to more dynamic, real-world applications like VIO. This approach might inspire future research to further explore adaptive policies in complex multi-modal learning scenarios.

Conclusion and Future Directions

The introduction of an adaptive visual modality selection strategy marks a pivotal development in the pursuit of efficient neural VIO systems. Future research could delve into refining the adaptive policy network to enhance flexibility and adaptability further, potentially integrating other sensory data for even broader applicability across different domains. Furthermore, advancements could explore more nuanced factors influencing policy decisions to ensure robust and generalized applicability across more diverse environments and tasks.