An Improved RaftStereo Trained with A Mixed Dataset for the Robust Vision Challenge 2022 (2210.12785v1)

Published 23 Oct 2022 in cs.CV

Abstract: Stereo-matching is a fundamental problem in computer vision. Despite recent progress by deep learning, improving the robustness is ineluctable when deploying stereo-matching models to real-world applications. Different from the common practices, i.e., developing an elaborate model to achieve robustness, we argue that collecting multiple available datasets for training is a cheaper way to increase generalization ability. Specifically, this report presents an improved RaftStereo trained with a mixed dataset of seven public datasets for the robust vision challenge (denoted as iRaftStereo_RVC). When evaluated on the training sets of Middlebury, KITTI-2015, and ETH3D, the model outperforms its counterparts trained with only one dataset, such as the popular Sceneflow. After fine-tuning the pre-trained model on the three datasets of the challenge, it ranks at 2nd place on the stereo leaderboard, demonstrating the benefits of mixed dataset pre-training.

Citations (6)

View on Semantic Scholar

Summary

The paper introduces a novel mixed dataset training approach that improves the robustness and generalization of RaftStereo for stereo matching.
The method employs a two-phase training with pre-training on seven diverse datasets and fine-tuning on individual challenge benchmarks to yield enhanced performance.
Evaluations on KITTI-2015, Middlebury, and ETH3D demonstrate improved foreground accuracy and balanced error ratios, securing a strong performance in the Robust Vision Challenge 2022.

An Improved RaftStereo for Stereo Matching

The paper "An Improved RaftStereo Trained with a Mixed Dataset for the Robust Vision Challenge 2022" presents a compelling approach to enhancing the robustness and generalization of stereo matching models in diverse environments. The paper departs from conventional strategies focused on model complexity, advocating instead for the utilization of mixed dataset training to achieve robustness and improve disparity estimation performance across multiple datasets.

Background and Methodology

Stereo matching, a crucial domain in computer vision, serves as the foundation for depth recovery, supporting applications like 3D reconstruction and robotic navigation. The evolution from traditional methods such as SGM and ELAS to deep learning-based approaches has significantly advanced stereo matching capabilities. Recent innovations are prominently seen in methods like RAFT, which inform iterative updates in stereo matching through algorithms like RaftStereo. These advances facilitate the construction of robust vision systems capable of operating in varied environments.

The paper introduces iRaftStereo_RVC, a model trained using a mixed dataset approach composed of seven public datasets, including Sceneflow, CreStereo, and Tartan~Air. This strategy is inspired by prior work indicating the efficacy of mixed datasets in improving cross-dataset generalization. The mixed dataset pre-training of RaftStereo revealed enhanced performance over models trained on a single dataset, such as Sceneflow.

Technical Implementation

RaftStereo constitutes the base model for this paper, leveraging the RAFT framework's iterative disparity field refinement. The authors implemented a standard RaftStereo setup utilizing multi-level convolutional GRUs and feature extraction at 1/4 resolution.

The mixed dataset introduced in this paper includes a blend of both synthetic and realistic datasets, such as HR-VS and InStereo2K, ensuring a broad representation of potential scene complexities. To balance dataset representation, certain datasets were repeated in varying frequencies. The model underwent a two-phased training schedule: an initial pre-training phase on the mixed dataset and a subsequent fine-tuning phase on the individual challenge datasets (KITTI-2015, Middlebury, ETH3D).

Results and Evaluation

The pre-trained iRaftStereo_RVC demonstrated superior zero-shot generalization across the KITTI-2015, Middlebury, and ETH3D datasets compared to models pre-trained on single datasets. Significantly, it ranked second in the Robust Vision Challenge 2022 stereo track, underscoring the utility of mixed dataset training.

A detailed examination of the KITTI-2015 benchmark results revealed that iRaftStereo_RVC excels, particularly in foreground region accuracy—an area critical for practical applications where object identification is paramount. This is reflected in a remarkably low error ratio between foreground and background regions, signifying balanced performance.

Implications and Future Directions

The findings of this paper hold practical implications for the deployment of stereo matching systems in unpredictable real-world settings. By incorporating a wide variety of datasets during pre-training, models can better adapt without sacrificing efficiency or requiring complex architectures.

Looking forward, this mixed dataset approach can be expanded further to optimize other vision tasks, potentially including components like temporal coherence in video sequences or integrating domain adaptation techniques. The ongoing development and availability of diverse datasets will continue to enhance the potential of such methodologies in improving vision system robustness.

In summary, this paper contributes valuable insights into the potential of mixed dataset pre-training for enhancing the performance and generalization of stereo matching systems, offering promising avenues for both academic research and industrial applications.

PDF Markdown

Related Papers

GitHub

GitHub - princeton-vl/RAFT-Stereo (879 stars)