YOLO9000: Better, Faster, Stronger (1612.08242v1)

Published 25 Dec 2016 in cs.CV

Abstract: We introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories. First we propose various improvements to the YOLO detection method, both novel and drawn from prior work. The improved model, YOLOv2, is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP, outperforming state-of-the-art methods like Faster RCNN with ResNet and SSD while still running significantly faster. Finally we propose a method to jointly train on object detection and classification. Using this method we train YOLO9000 simultaneously on the COCO detection dataset and the ImageNet classification dataset. Our joint training allows YOLO9000 to predict detections for object classes that don't have labelled detection data. We validate our approach on the ImageNet detection task. YOLO9000 gets 19.7 mAP on the ImageNet detection validation set despite only having detection data for 44 of the 200 classes. On the 156 classes not in COCO, YOLO9000 gets 16.0 mAP. But YOLO can detect more than just 200 classes; it predicts detections for more than 9000 different object categories. And it still runs in real-time.

PDF Abstract

YOLO9000: Better, Faster, Stronger

The paper introduces YOLO9000, a real-time object detection system capable of recognizing over 9000 object categories. This work builds on YOLO (You Only Look Once), an existing state-of-the-art detection system, and presents several enhancements leading to YOLOv2. The improved model achieves remarkable efficacy on benchmark datasets like Pascal VOC and COCO. Notably, YOLO9000 accomplishes joint training on both detection and classification tasks, leveraging a vast array of object categories from COCO and ImageNet datasets.

Overview of YOLOv2 Improvements

Several key improvements are introduced in YOLOv2, enhancing its performance in terms of accuracy and speed.

Batch Normalization: Integration of batch normalization in all convolutional layers eliminates the need for dropout, leading to a more than 2% improvement in mean Average Precision (mAP).
High-Resolution Classifier: Fine-tuning the network on higher resolution images from ImageNet before training for detection contributes to a 4% increase in mAP.
Convolutional with Anchor Boxes: Replacing fully connected layers with convolutional layers and introducing anchor boxes simplifies the prediction of bounding boxes, facilitating better recall at a slight trade-off in accuracy.
Dimension Clusters: Using k-means clustering on bounding box dimensions to derive priors improves the initial bounding box predictions, enhancing performance stability.
Direct Location Prediction: Refining the model by predicting bounding box coordinates relative to grid cells using a logistic function, resulting in better model stability during training.
Fine-Grained Features: Incorporating a passthrough layer to merge higher resolution features with lower resolution features provides additional fine-grained information, boosting performance for smaller object detection.
Multi-Scale Training: Training the model on images of varying resolutions makes the network robust across different scales, enabling a trade-off between speed and accuracy without requiring multiple models.

These improvements collectively lead YOLOv2 to reach 78.6 mAP on VOC 2007 while running at 40 FPS on standard hardware, significantly outpacing competitors like Faster R-CNN ResNet and SSD.

YOLO9000's Joint Training Mechanism

YOLO9000 introduces an innovative joint training methodology, effectively utilizing labeled detection images to learn precise object localization and labeled classification images to enhance vocabulary and robustness.

The primary challenge in joint training is the reconciliation of label discrepancies between detection datasets (with general labels) and classification datasets (with specific labels). YOLO9000 addresses this by employing a hierarchical classification approach using a structure called WordTree, derived from WordNet. This hierarchy enables coherent combination and utilization of diverse labels from both ImageNet and COCO, allowing for a more extensive range of detectable objects.

Practical and Theoretical Implications

The practical implications of YOLO9000 are significant, particularly in applications requiring real-time object detection, such as autonomous driving, robotics, and surveillance systems. The ability to detect over 9000 categories in real-time, combined with high accuracy, makes YOLO9000 a versatile and reliable tool for various domains.

Theoretically, the hierarchical classification approach using WordTree provides a framework for integrating diverse datasets, mitigating the dataset size gap between detection and classification tasks. This technique opens new pathways for leveraging large-scale datasets effectively, potentially influencing future research in weakly supervised learning and multi-task learning.

Future Directions

Future research can explore the applicability of WordTree in other domains such as image segmentation and further refine the joint training methodology to improve label assignment strategies. Additionally, techniques like multi-scale training may be beneficial across various visual tasks beyond object detection. The potential for weakly supervised image segmentation using similar hierarchical approaches is another promising direction, aiming to enhance model generalization and robustness.

In conclusion, YOLO9000 represents a significant advancement in real-time object detection, addressing crucial challenges in scaling object recognition capabilities. Its contributions to hierarchical classification and joint training methodologies are poised to impact future developments in computer vision significantly.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Joseph Redmon (8 papers)
Ali Farhadi (138 papers)

Citations (14,698)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos