You Only Look Once: Unified, Real-Time Object Detection (1506.02640v5)

Published 8 Jun 2015 in cs.CV

Abstract: We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is far less likely to predict false detections where nothing exists. Finally, YOLO learns very general representations of objects. It outperforms all other detection methods, including DPM and R-CNN, by a wide margin when generalizing from natural images to artwork on both the Picasso Dataset and the People-Art Dataset.

PDF Abstract

You Only Look Once: Unified, Real-Time Object Detection

The paper "You Only Look Once: Unified, Real-Time Object Detection" presented by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi introduces an innovative approach known as YOLO. This methodology reconceptualizes object detection as a single regression problem, departing from traditional classifier-based frameworks and enabling end-to-end optimization directly on detection performance.

Introduction and Methodology

YOLO unifies the object detection pipeline into a single neural network structure. Unlike previous systems that relied on classifiers and intricate pipelines involving region proposals (e.g., R-CNN variants), YOLO processes full images and outputs bounding boxes and class probabilities in a single evaluation. This approach simplifies the detection framework and accelerates performance.

The YOLO model divides an input image into an $S \times S$ grid. Each cell in the grid predicts $B$ bounding boxes, along with confidence scores and class probabilities for each box. The confidence score is computed as the product of the probability of an object being in a bounding box and the Intersection over Union (IOU) between the predicted and ground truth boxes. By constraining the individual cells to predict bounding boxes, YOLO effectively mitigates excessive false positives and localizes objects by considering the entire image context.

Performance and Results

The YOLO model showcases exceptional speed, processing images at 45 frames per second (fps) in its base configuration, and up to 155 fps in its faster variant, Fast YOLO. This performance is achieved on a Titan X GPU, making YOLO suitable for real-time applications. The mean average precision (mAP) of YOLO is comparatively superior to other real-time detectors, registering more than twice the mAP of contemporary systems.

When tested on the Pascal VOC dataset, YOLO demonstrates robust generalization abilities, outstripping methods like DPM and R-CNN in domains beyond natural images, such as artwork. This generalization capability signifies that YOLO has learned highly versatile object representations, applicable even to previously unseen categories.

One tradeoff observed with YOLO is an increased number of localization errors compared to methods like Fast R-CNN, which, while more accurate in localization, are computatively slower. However, YOLO substantially reduces background false positives by providing global reasoning over the entire image during prediction, improving the robustness of its detections.

Comparison to Existing Methods

YOLO's approach diverges from traditional object detection pipelines significantly. Systems using sliding windows or region proposals (e.g., DPM, R-CNN) suffer from architectural complexity and slower operations. R-CNN, for instance, operates through multiple stages including selective search and SVM-based classification, rendering it sluggish (e.g., 40 seconds per image). In contrast, YOLO’s unified architecture discards intermediary steps, favoring an all-in-one model that optimizes both speed and detection performance.

While Faster R-CNN offers improvements over R-CNN by incorporating region proposal networks, it still lags behind YOLO in real-time performance, although it exhibits higher mAP in some benchmarks. The YOLO model, with its fewer proposals and contextual grid constraints, balances speed and accuracy adeptly.

Practical Implications and Future Directions

The real-time capabilities of YOLO make it particularly appealing for applications necessitating instantaneous object recognition, such as autonomous driving, surveillance, and interactive systems. Its superior speed and simplicity in integration offer a game-changing advantage over complex, multi-stage detectors. Additionally, YOLO’s aptitude for generalization suggests its potential utility in diverse image domains without extensive retraining, demonstrating its flexibility beyond conventional datasets.

In future developments, addressing YOLO’s localization deficiencies, especially for smaller objects, can further enhance its precision. Optimizing the loss function to better handle errors in small bounding boxes, refining grid cell predictions, and experimenting with different network deeper architectures could drive further advancements.

Conclusion

The YOLO framework, by reframing object detection as a regression task, establishes new benchmarks in real-time object recognition, providing a robust, swift alternative to traditional detection systems. Its unified architecture underscores significant strides in detection efficiency and generalization, positioning YOLO as a pivotal model in the landscape of computer vision research and applied AI technologies.