YOLOP: You Only Look Once for Panoptic Driving Perception (2108.11250v7)

Published 25 Aug 2021 in cs.CV

Abstract: A panoptic driving perception system is an essential part of autonomous driving. A high-precision and real-time perception system can assist the vehicle in making the reasonable decision while driving. We present a panoptic driving perception network (YOLOP) to perform traffic object detection, drivable area segmentation and lane detection simultaneously. It is composed of one encoder for feature extraction and three decoders to handle the specific tasks. Our model performs extremely well on the challenging BDD100K dataset, achieving state-of-the-art on all three tasks in terms of accuracy and speed. Besides, we verify the effectiveness of our multi-task learning model for joint training via ablative studies. To our best knowledge, this is the first work that can process these three visual perception tasks simultaneously in real-time on an embedded device Jetson TX2(23 FPS) and maintain excellent accuracy. To facilitate further research, the source codes and pre-trained models are released at https://github.com/hustvl/YOLOP.

Citations (268)

View on Semantic Scholar

Summary

The paper introduces a novel end-to-end multi-task neural network (YOLOP) that achieves real-time traffic object detection, drivable area segmentation, and lane detection on embedded devices.
The network leverages a shared CSPDarknet-based encoder and specialized decoders with efficient upsampling to boost inference speed and accuracy.
YOLOP demonstrates competitive metrics on the BDD100K dataset, achieving 76.5% mAP50 for detection, 91.5% mIoU for drivable area segmentation, and 26.2% IoU for lane detection.

YOLOP: You Only Look Once for Panoptic Driving Perception (YOLOP: You Only Look Once for Panoptic Driving Perception, 2021) introduces a novel multi-task neural network designed for real-time autonomous driving perception. The paper addresses the critical need for simultaneously performing traffic object detection, drivable area segmentation, and lane detection with high accuracy and speed, particularly on computationally limited embedded devices.

The core idea is to build an efficient, end-to-end trainable network called YOLOP. This network comprises a single shared encoder for feature extraction and three distinct decoders, each specializing in one of the three perception tasks. This architecture allows for shared computation and potentially beneficial information flow between related tasks, leading to reduced inference time compared to processing each task sequentially with separate models.

Network Architecture

Encoder: The encoder is responsible for extracting rich features from the input image. It uses a CSPDarknet [cspdarknet] as the backbone, known for its efficiency and ability to handle gradient issues. A neck network, combining Spatial Pyramid Pooling (SPP) [sppnet] and Feature Pyramid Network (FPN) [fpn] with Path Aggregation Network (PAN) [pannet], fuses features from different scales and semantic levels, providing robust feature maps for the decoders.
Decoders:
- Detect Head: This decoder is based on an anchor-based multi-scale detection approach, similar to YOLOv4 [yolov4]. It takes multi-scale features from the PAN and predicts bounding box locations, dimensions, object confidence, and class probabilities for traffic objects.
- Drivable Area Segment Head: This head performs pixel-wise semantic segmentation to identify the drivable regions. It takes features from the lower layers of the FPN and uses a simple structure involving upsampling layers with Nearest Interpolation (for computational efficiency) to produce a full-resolution segmentation mask (drivable area vs. background).
- Lane Line Segment Head: Similar in structure to the drivable area head, this decoder predicts the location of lane lines pixel by pixel. It also uses FPN features and efficient upsampling to output a full-resolution segmentation mask (lane line vs. background).

Loss Function

The network is trained end-to-end using a combined loss function $\mathcal L_{all}$ , which is a weighted sum of the losses for each task: $\mathcal L_{det}$ , $\mathcal L_{da-seg}$ , and $\mathcal L_{ll-seg}$ .

The detection loss ( $\mathcal L_{det}$ ) is a combination of classification loss (using Focal Loss [focalloss]), objectness loss (using Focal Loss), and bounding box regression loss (using CIoU Loss [d/ciouloss]).
The segmentation losses ( $\mathcal L_{da-seg}$ and $\mathcal L_{ll-seg}$ ) primarily use Cross Entropy Loss with Logits. For the lane line segmentation loss ( $\mathcal L_{ll-seg}$ ), an additional IoU loss is included to specifically handle the sparse nature of lane lines, helping to improve their prediction accuracy.

Training and Implementation Details

The network is trained on the BDD100K dataset [bdd100k], which provides diverse driving scenes and annotations for object detection, drivable areas, and lane lines.
Training utilizes practical techniques such as k-means clustering for generating prior anchors for detection, Adam optimizer, learning rate scheduling (warm-up and cosine annealing), and various data augmentations (photometric and geometric distortions) to improve robustness.
While alternating optimization training paradigms were explored, the paper demonstrates that end-to-end joint training yields comparable or better performance with less complexity.

Practical Performance and Results

The key achievement of YOLOP is its ability to perform all three tasks simultaneously in real-time on embedded hardware while achieving competitive accuracy.

On the BDD100K validation set (with images resized to 640x384), YOLOP achieves state-of-the-art or highly competitive performance on all three tasks compared to existing single-task and multi-task methods.
- Traffic Object Detection: Achieves 89.2% Recall and 76.5% mAP50, comparable to or exceeding other methods like Faster R-CNN [faster-rcnn], MultiNet [multinet], and DLT-Net [dlt-net], and close to YOLOv5s yolov4.
- Drivable Area Segmentation: Achieves 91.5% mIoU, significantly outperforming MultiNet, DLT-Net, and PSPNet [pspnet].
- Lane Detection: Achieves 70.50% Pixel Accuracy and 26.20% IoU, substantially better than ENet [enet], SCNN [scnn], and ENet-SAD [sad-enet].
Crucially, YOLOP achieves an inference speed of 41 FPS on an NVIDIA TITAN XP and 23 FPS on an embedded Jetson TX2. This is highlighted as the first work to achieve real-time performance for this specific combination of tasks on such a device at this accuracy level.

Implementation Considerations and Ablations

The simplicity of the decoder structures and the use of efficient upsampling (Nearest Interpolation) contribute significantly to the network's speed.
Ablation studies validate the multi-task approach:
- Joint multi-task training shows performance very close to training single-task models separately, indicating effective information sharing without significant degradation of individual task performance.
- A comparison between YOLOP (grid-based detection) and a modified Faster R-CNN (R-CNNP, region-based detection) extended with the same segmentation heads demonstrates that the grid-based prediction mechanism of YOLOP is more compatible with segmentation tasks for joint training, leading to better overall multi-task performance compared to the region-based approach.

Real-World Applications

YOLOP is directly applicable to autonomous driving and Advanced Driver-Assistance Systems (ADAS). By providing simultaneous information about obstacles, drivable paths, and lane boundaries in real-time from a single camera feed, it can serve as a foundational component for perception systems used in navigation, planning, and control modules of autonomous vehicles, especially in scenarios with limited computational resources.

Limitations and Future Work

The authors suggest exploring more sophisticated multi-task learning paradigms to potentially further boost performance, although the current end-to-end training is shown to be quite effective. They also propose extending the framework to incorporate additional driving perception tasks, such as depth estimation, to create a more comprehensive perception system.

PDF Markdown

YOLOP: You Only Look Once for Panoptic Driving Perception (2108.11250v7)

Summary

Related Papers