Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry

Published 16 Jun 2024 in cs.CV | (2406.11019v1)

Abstract: For the task of simultaneous monocular depth and visual odometry estimation, we propose learning self-supervised transformer-based models in two steps. Our first step consists in a generic pretraining to learn 3D geometry, using cross-view completion objective (CroCo), followed by self-supervised finetuning on non-annotated videos. We show that our self-supervised models can reach state-of-the-art performance 'without bells and whistles' using standard components such as visual transformers, dense prediction transformers and adapters. We demonstrate the effectiveness of our proposed method by running evaluations on six benchmark datasets, both static and dynamic, indoor and outdoor, with synthetic and real images. For all datasets, our method outperforms state-of-the-art methods, in particular for depth prediction task.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper presents an innovative self-supervised framework that integrates CroCo pretraining with finetuning to accurately learn 3D geometry.
The approach employs a transformer architecture enhanced with adapters and Dense Prediction Transformer, using photometric, geometric consistency, and edge-aware smoothness losses.
Experimental evaluations on six diverse benchmarks demonstrate that the model outperforms state-of-the-art methods, setting new accuracy records on KITTI and NYUv2.

Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry

In the paper "Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry" by Boris Chidlovskii and Leonid Antsfeld from Naver Labs Europe, the authors present a novel approach to simultaneously address monocular depth estimation and visual odometry (DVO), leveraging self-supervised learning techniques. The core contribution of the paper lies in integrating a self-supervised pretraining step followed by a fine-tuning step, both of which do not require annotated data.

Methodology

The proposed approach is divided into two primary steps:

Self-supervised Pretraining with CroCo: The authors utilize a transformer-based architecture, known as Cross-View Completion (CroCo). This pretraining phase involves learning 3D geometry by reconstructing a partially masked input image with the help of another view of the same scene. This task, closely related to masked image modeling (MIM), trains the model to understand geometric structures using large-scale, heterogeneous datasets.
Self-supervised Finetuning: The model, pre-trained with CroCo, is fine-tuned on non-annotated videos. During this stage, two tasks are addressed: depth estimation and visual odometry, using a shared encoder and task-specific decoders. Self-supervised losses, including photometric loss, geometric consistency loss, and edge-aware smoothness loss, guide the fine-tuning process.

Architectural Innovations

The architecture integrates standard transformer blocks but emphasizes additional elements to enhance performance:

Adapters: Small, trainable layers added to the fixed main backbone, significantly reducing the finetuning computational cost.
Dense Prediction Transformer (DPT): An extension that adapts standard up-convolutions and fusions from multiple layers, specifically targeting dense prediction tasks.

Experimental Evaluation

The method was rigorously evaluated on six benchmark datasets covering diverse conditions such as indoor, outdoor, static, dynamic, synthetic, and real environments:

KITTI: An outdoor urban driving dataset.
NYUv2: A popular indoor scene dataset.
Ddad: Another outdoor urban driving dataset.
Bonn: Dynamic indoor scenes.
Tum: Dynamic indoor videos.
Gibson: Primarily used for navigation tasks with synthetic indoor scenes.

The results demonstrated that the proposed model consistently outperforms existing state-of-the-art methods across multiple metrics. For instance, the CroCo-DVO model achieved an AbsRel error of 0.098 on KITTI and 0.095 on NYUv2, setting new benchmarks for self-supervised depth estimation in these datasets.

Implications and Future Directions

The research underscores the efficacy of combining transformer architectures with self-supervised learning for geometric vision tasks. By eliminating the need for annotated data, this multifaceted approach is particularly well-suited for scenarios where data labeling is impossible or impractical. The methodologies proposed extend beyond monocular depth and visual odometry and could be adapted to other vision-based tasks such as optical flow and stereo matching.

Future research may explore further optimizations of the architecture, potentially incorporating additional self-supervised learning paradigms or extending this approach to more complex scenes and tasks within autonomous navigation and obstacle avoidance frameworks. The robustness and scalability demonstrated by CroCo-DVO indicate promising directions for unified methodologies in computer vision and AI, particularly enhancing performance on under-explored and data-scarce domains.

The advancements shown in this paper pave the way for more generalized models capable of understanding and interacting with the world in a self-supervised manner, thereby pushing the boundaries of what is achievable in AI-driven visual perception systems.