OmniDet: Surround View Cameras based Multi-task Visual Perception Network for Autonomous Driving (2102.07448v3)

Published 15 Feb 2021 in cs.CV and cs.RO

Abstract: Surround View fisheye cameras are commonly deployed in automated driving for 360\deg{} near-field sensing around the vehicle. This work presents a multi-task visual perception network on unrectified fisheye images to enable the vehicle to sense its surrounding environment. It consists of six primary tasks necessary for an autonomous driving system: depth estimation, visual odometry, semantic segmentation, motion segmentation, object detection, and lens soiling detection. We demonstrate that the jointly trained model performs better than the respective single task versions. Our multi-task model has a shared encoder providing a significant computational advantage and has synergized decoders where tasks support each other. We propose a novel camera geometry based adaptation mechanism to encode the fisheye distortion model both at training and inference. This was crucial to enable training on the WoodScape dataset, comprised of data from different parts of the world collected by 12 different cameras mounted on three different cars with different intrinsics and viewpoints. Given that bounding boxes is not a good representation for distorted fisheye images, we also extend object detection to use a polygon with non-uniformly sampled vertices. We additionally evaluate our model on standard automotive datasets, namely KITTI and Cityscapes. We obtain the state-of-the-art results on KITTI for depth estimation and pose estimation tasks and competitive performance on the other tasks. We perform extensive ablation studies on various architecture choices and task weighting methodologies. A short video at https://youtu.be/xbSjZ5OfPes provides qualitative results.

Authors (8)

Varun Ravi Kumar (26 papers)
Senthil Yogamani (81 papers)
Hazem Rashed (15 papers)
Ganesh Sistu (44 papers)
Christian Witt (8 papers)
Isabelle Leang (6 papers)
Stefan Milz (23 papers)
Patrick Mäder (23 papers)

Citations (79)

View on Semantic Scholar

Summary

The paper introduces a multi-task network that outperforms single-task approaches by sharing encoder features and synergizing task-specific decoders.
It employs a novel camera geometry tensor to address fisheye distortion, enabling accurate depth, semantic, and object detection in surround-view images.
Experimental results on datasets like KITTI and WoodScape demonstrate OmniDet’s efficiency and robustness for diverse autonomous driving perception tasks.

Overview of "OmniDet: Surround View Cameras Based Multi-task Visual Perception Network for Autonomous Driving"

This paper introduces "OmniDet," a comprehensive multi-task visual perception network specifically designed to process unrectified fisheye images from surround-view cameras used in autonomous vehicles. The paper elaborates on the significance of a unified perception model capable of performing various tasks efficiently and accurately, leveraging a single network architecture. It highlights six critical tasks essential for autonomous driving, including depth estimation, visual odometry, semantic segmentation, motion segmentation, object detection, and lens soiling detection.

Key Contributions

1. Multi-task Learning Efficiency:

The authors present a multi-task network capable of outperforming single-task networks on these perception tasks. The multi-task approach optimizes resource utilization by employing a shared encoder and task-specific decoders, which provide computational advantages and foster inter-task synergies.

2. Camera Geometry Tensor (CGT):

To address fisheye distortion, the paper introduces a novel camera geometry tensor that encapsulates radial distortion characteristics during training and inference. This allows the model to generalize across varying camera intrinsics and viewing angles, demonstrated using the WoodScape dataset consisting of diverse fisheye camera data.

3. Advanced Object Detection:

OmniDet employs a generalized object detection approach where objects are represented using a polygon with non-uniformly sampled vertices, accounting for the radial distortion in fisheye lenses. This method replaces traditional bounding boxes, which are inadequate for distorted images, with a more accurate polygonal representation.

4. Task Synergy and Novel Techniques:

The paper proposes synergized task decoders that allow tasks like semantic and motion segmentation to enhance distance estimation by addressing dynamic object filtering. Additionally, the VarNorm technique for task loss weighting normalizes task loss variances, improving the training stability of the multi-task network.

Numerical Results and Evaluation

The paper reports state-of-the-art performance on the KITTI dataset for depth and pose estimation, with competitive results for other tasks on datasets like Cityscapes. Extensive evaluations include ablation studies on architecture choices and weightings strategies, demonstrating the robustness of the proposed model. The integration of self-attention modules and novel loss functions further optimize performance across diverse tasks.

Implications and Future Directions

The implications of this research are manifold, impacting the design of perception systems in autonomous vehicles. The OmniDet framework sets a new precedent in handling fisheye camera distortions and multi-task learning for automated driving, enhancing both accuracy and computational efficiency. Future research may focus on extending this framework to encompass additional perception tasks, improving robustness to adversarial conditions, and further optimizing real-time performance for deployment in commercial vehicles.

Additionally, the adaptability of CGT for diverse camera systems and dynamic environments opens avenues for broader applications beyond autonomous driving, such as in robotics and surveillance, where omnidirectional vision systems are prevalent. The integration of this adaptive mechanism within neural network architectures represents a significant forward step in achieving flexible, hardware-agnostic perception models.

PDF Markdown

Related Papers

YouTube

Show All Videos