- The paper introduces three synthetic datasets (FlyingThings3D, Monkaa, Driving) that provide extensive training data for disparity, optical flow, and scene flow estimation.
- The paper presents novel ConvNet architectures, including DispNet and DispNetCorr1D, that leverage encoder-decoder designs and correlation layers for improved performance.
- The paper demonstrates that training with synthetic data yields state-of-the-art, real-time results, with significant implications for autonomous driving and ADAS applications.
A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation
Introduction
This paper addresses the significant gap in available datasets for training convolutional networks (ConvNets) to estimate scene flow, which consists of deriving the depth and 3D motion vectors of all visible points in a stereo video. Existing datasets are insufficient in size and diversity to train large ConvNets effectively. The proposed solution in this paper is a suite of three synthetic datasets generated using Blender, providing substantial data for disparity, optical flow, and scene flow estimation.
Dataset Overview
The new dataset collection includes three distinct subsets: FlyingThings3D, Monkaa, and Driving. Together, these datasets encompass various scenes with over 35,000 stereo frames, offering extensive, dense ground truth annotations for forward and backward disparity, optical flow, and disparity change, in addition to object segmentation and camera calibration data.
- FlyingThings3D: This dataset includes everyday objects animated along randomized 3D trajectories. It consists of approximately 25,000 stereo frames, and it is specifically designed to facilitate the training of large networks by providing substantial diverse data.
- Monkaa: Derived from the open-source animated short film "Monkaa," this subset includes complex scenes with non-rigid articulated motion and is designed to resemble the characteristics of the Sintel dataset.
- Driving: Emulating the KITTI dataset setting, this subset features realistic driving scenes with dynamic elements like moving vehicles and changing lighting conditions.
Network Architecture and Training
The paper introduces a series of ConvNet architectures, namely DispNet and DispNetCorr1D, aimed at addressing various aspects of the scene flow estimation problem. The networks employ an encoder-decoder architecture with cross-links to process stereo image pairs for estimating disparity and optical flow.
- DispNet processes two stacked RGB images and utilizes convolutional and up-sampling layers to generate dense disparity maps efficiently.
- DispNetCorr1D introduces a correlation layer that captures horizontal correspondences between features from stereo image pairs, informed by the epipolar constraint, enhancing the network's ability to handle large disparities.
Numerical Results and Evaluation
The paper presents a comprehensive evaluation comparing the performance of the proposed networks to existing methods such as SGM and MC-CNN. Notable findings include:
- DispNetCorr1D achieved state-of-the-art performance on the KITTI 2015 benchmark, with significant improvements in error rates for foreground pixels.
- Training on the new datasets and fine-tuning on KITTI results demonstrated the networks' ability to deliver competitive results in real-time (around 15 frames per second on high-resolution stereo images).
These evaluations underscore the effectiveness of the proposed datasets and networks in addressing the limitations of earlier approaches.
Practical and Theoretical Implications
The synthetic datasets presented in this paper facilitate the training of large-scale ConvNets, providing a valuable resource for advancing research in scene flow estimation. The joint estimation of disparity and optical flow from stereo images is demonstrated to be both computationally efficient and accurate, with potential applications in autonomous driving and advanced driver assistance systems (ADAS).
The proposed methods and results demonstrate promising directions for future development in AI and computer vision:
- End-to-End Training: The success of end-to-end trained networks such as DispNet and DispNetCorr1D emphasizes the potential for developing more complex architectures that can jointly estimate multiple features from stereo pairs.
- Real-time Performance: The ability to achieve real-time performance suggests practical applications in real-world systems, especially within autonomous vehicle navigation systems where timely and accurate scene flow estimation is crucial.
- Synthetic Data Utility: The use of synthetic datasets challenges the conventional reliance on real-world datasets, highlighting the benefits and higher scalability of synthetic data for training deep learning models in vision tasks.
Future Work
Building on the current work, future research could explore several promising avenues:
- Extended Dataset Generation: Further expanding the synthetic datasets to include more diverse and realistic scenarios will bolster the training of even more robust networks.
- Joint Optimization: Developing and experimenting with new architectures for joint optimization of disparity, optical flow, and scene flow to improve accuracy and reduce computational overhead.
- Transfer Learning: Investigating transfer learning approaches to leverage the synthetic dataset-trained models for fine-tuning on limited real-world data to enhance performance on practical applications.
In conclusion, this paper introduces valuable resources and methodologies that significantly advance the capabilities of ConvNets in disparity, optical flow, and scene flow estimation. The synthetic datasets and proposed network architectures hold substantial promise for both academic research and practical implementations in various fields of computer vision and autonomous systems.