Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry (2409.08769v1)

Published 13 Sep 2024 in cs.CV

Abstract: In recent years, transformer-based architectures become the de facto standard for sequence modeling in deep learning frameworks. Inspired by the successful examples, we propose a causal visual-inertial fusion transformer (VIFT) for pose estimation in deep visual-inertial odometry. This study aims to improve pose estimation accuracy by leveraging the attention mechanisms in transformers, which better utilize historical data compared to the recurrent neural network (RNN) based methods seen in recent methods. Transformers typically require large-scale data for training. To address this issue, we utilize inductive biases for deep VIO networks. Since latent visual-inertial feature vectors encompass essential information for pose estimation, we employ transformers to refine pose estimates by updating latent vectors temporally. Our study also examines the impact of data imbalance and rotation learning methods in supervised end-to-end learning of visual inertial odometry by utilizing specialized gradients in backpropagation for the elements of SE$(3)$ group. The proposed method is end-to-end trainable and requires only a monocular camera and IMU during inference. Experimental results demonstrate that VIFT increases the accuracy of monocular VIO networks, achieving state-of-the-art results when compared to previous methods on the KITTI dataset. The code will be made available at https://github.com/ybkurt/VIFT.

Summary

The paper introduces VIFT, a novel transformer-based method that refines pose estimates by fusing visual and inertial data.
It leverages Riemannian manifold optimization to improve rotation accuracy, overcoming traditional angle representation issues.
Empirical results on the KITTI dataset demonstrate significant performance gains over conventional RNN-based approaches.

Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry

The paper "Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry" by Yunus Bilge Kurt et al. presents a novel approach leveraging transformer architectures to improve pose estimation in visual-inertial odometry (VIO) tasks. The proposed method, named the Visual-Inertial Fusion Transformer (VIFT), incorporates causal transformers to refine pose estimates by effectively utilizing historical data. This technique addresses the challenges posed by data imbalance and rotation learning while achieving state-of-the-art results on the KITTI dataset.

Background and Motivation

VIO methods estimate the pose of a moving body by fusing data from visual and inertial sensors. Traditional geometry-based VIO solutions achieve accurate results but require meticulous initialization and calibration. End-to-end learning approaches have emerged to bypass some of these constraints by directly learning to fuse the sensory information. Despite promising advances, existing deep VIO methods predominantly rely on RNN-based models for temporal dependencies, which may not fully leverage the potential temporal structure in the data. Moreover, rotation estimation continues to challenge deep learning frameworks due to the limitations posed by traditional angle representations like quaternions or Euler angles.

Contributions

The key contributions of this paper are as follows:

Transformer-Based Fusion and Pose Estimation: VIFT replaces the conventional RNN-based modules with transformer-based architectures for better modeling of temporal dependencies. This enables the model to refine latent visual-inertial vectors using historical context, enhancing the accuracy of pose estimates.
Riemannian Manifold Optimization for Rotations: The paper utilizes Regularized Projective Manifold Gradient (RPMG) techniques for rotation regression. This approach optimizes rotations on their inherent manifold, addressing the issues presented by traditional representations and improving rotational accuracy.
Empirical Performance: Experimental results demonstrate that VIFT surpasses previous methods in accuracy on the KITTI dataset. The paper highlights that introducing transformers and RPMG significantly enhances deep VIO performance over RNN-based counterparts.

Methodology

VIFT Architecture

VIFT employs frozen image and inertial encoders to obtain latent vectors. These vectors are then processed by transformer layers that perform visual-inertial fusion and pose estimation using causal masks to ensure temporal coherency. The overall architecture involves:

Visual and Inertial Encoders: Visual features are extracted using a FlowNet-based encoder, while inertial data is processed using a 1D CNN encoder.
Causal Transformer Layers: These layers refine the latent vectors by weighting them based on both current and previous measurements. This mechanism improves pose estimates by leveraging enhanced temporal context.
MLP for Pose Estimation: The output of transformer layers is fed into a two-layer MLP to estimate the relative pose between frames.

Loss Function

The authors employ a composite loss function combining translational and rotational errors. The incorporation of RPMG in the training process ensures that gradients for rotational parameters are computed in a manifold-aware manner, thereby enhancing the network's ability to learn accurate rotations.

Experimental Results

The empirical evaluation on the KITTI dataset shows that VIFT achieves superior performance compared to other state-of-the-art methods. Specifically, it provides significant reductions in relative translation and rotation errors, demonstrating the efficacy of transformer-based temporal modeling and manifold optimization techniques.

Ablation Study

Several ablation experiments verify the importance of different components:

Model Type: Comparison with MLP-based architectures highlights the transformer’s capability to leverage temporal dependencies more effectively.
Loss Function: The choice of L1 loss over L2 loss shows faster convergence and better performance.
Data Balancing: Adjusting for underrepresented rotational movements yields mixed results, necessitating careful tuning.
Sequence Length: VIFT’s performance varies with sequence length, indicating a potential need for more training data or refined handling of longer sequences.

Implications and Future Work

The practical implications of VIFT are broad, offering robust pose estimation capabilities leveraging monocular cameras and IMUs. The integration of transformers into VIO frameworks presents a significant step toward achieving more accurate and resilient models suited for diverse environmental conditions. The use of RPMG for rotation learning sets a precedent for addressing similar challenges in other domains requiring manifold optimization.

Future research can explore several avenues: extending VIFT to stereo vision inputs, investigating alternative transformer architectures, and integrating additional sensor modalities. Moreover, refining data balancing techniques and enhancing the model's performance with very long sequences could further advance the field of VIO.

Conclusion

The paper successfully introduces an innovative causal transformer architecture for enhancing deep visual-inertial odometry. VIFT demonstrates significant improvements in pose estimation accuracy by effectively leveraging historical data through transformers and optimizing rotations on their inherent manifolds. These contributions mark a step forward in the evolution of deep learning-based VIO methods, presenting opportunities for further advancements and applications in autonomous navigation and robotics.

PDF Markdown

Related Papers

GitHub

GitHub - ybkurt/VIFT (9 stars)

Tweets

https://twitter.com/zhenjun_zhao/status/1835622148909776989

https://twitter.com/yb_kurt/status/1835560111495864826

https://twitter.com/CSVisionPapers/status/1835778860807331985