Microsoft COCO: Common Objects in Context (1405.0312v3)

Published 1 May 2014 in cs.CV

Abstract: We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

PDF Abstract

Overview of Microsoft COCO: Common Objects in Context

The paper, "Microsoft COCO: Common Objects in Context," authored by Tsung-Yi Lin et al., introduces the Microsoft COCO (MS COCO) dataset, designed to advance the field of object recognition by situating it within the broader scope of scene understanding. The dataset comprises 328,000 images, annotated with 2.5 million object instances spanning 91 object categories. These per-instance segmentations facilitate precise object localization, thus helping to address key challenges in scene understanding. The dataset's construction heavily involved crowd work through Amazon Mechanical Turk (AMT), employing novel user interfaces to ensure comprehensive category detection, instance spotting, and instance segmentation.

Key Contributions and Comparative Analysis

The primary contributions of the MS COCO dataset are:

Focus on Non-Iconic Views: The dataset emphasizes non-iconic perspectives of objects, which are more representative of real-world scenes where objects might be occluded or placed among clutter.
Contextual Relationships: Unlike previous datasets that contain objects in isolated or iconic views, MS COCO captures scenes rich in contextual information, facilitating research on contextual reasoning between objects.
Detailed Spatial Localization: The dataset provides precise instance segmentation masks, enabling accurate evaluation of object localization methods.

A detailed statistical analysis comparing MS COCO to other prominent datasets like PASCAL VOC, ImageNet, and SUN underpins the paper. MS COCO stands out by having a higher number of instances per category and a greater number of instances per image, which can enhance the learning of nuanced object models. For instance, MS COCO has an average of 7.7 instances per image, compared to 3.0 in ImageNet and 2.3 in PASCAL VOC.

Dataset Creation and Annotation Pipeline

The construction of the MS COCO dataset involved multiple stages:

Image Collection: Emphasis was placed on gathering images from Flickr using pairwise object and scene queries to ensure non-iconic views and rich contextual scenes.
Category Labeling: A hierarchical approach was employed, where workers identified super-categories first, followed by specific categories within these groups.
Instance Spotting and Segmentation: Workers identified each instance of the categories and segmented them using a detailed, verified pipeline to ensure high-quality annotations.

Evaluation and Performance Analysis

Baseline performance analyses using Deformable Parts Models (DPM) trained on both PASCAL VOC and MS COCO are presented. The models trained on MS COCO exhibit improved generalization to the PASCAL VOC dataset, indicating the richness and challenge posed by MS COCO. However, the performance in object detection in MS COCO is generally lower, reflecting its complexity and the difficulty of detecting objects in non-iconic and cluttered environments.

Implications and Future Directions

The MS COCO dataset has practical implications for advancing object detection, segmentation, and contextual reasoning in computer vision algorithms. Its design favors the development of models capable of handling real-world scenarios with high object occlusion and abundant contextual information. The dataset's comprehensive nature and rich annotations make it a valuable resource for training and evaluating sophisticated computer vision models.

In terms of theoretical implications, the dataset underscores the importance of non-iconic perspectives and contextual relationships in scene understanding. Moving forward, augmenting MS COCO with additional annotations, such as object attributes or scene types, could further deepen its utility. Additionally, incorporating "stuff" categories alongside "things" might provide further insights into the interactions between different components of a scene.

Conclusion

The Microsoft COCO dataset represents a significant step forward for scene understanding and object recognition research. By emphasizing contextual relationships and non-iconic views, it presents a more realistic challenge for computer vision models. Consequently, it fosters the development of algorithms that are not only robust in detecting and segmenting objects but also proficient in incorporating contextual reasoning. The extensive and detailed annotations set a high bar for future datasets, pushing the boundaries of what can be achieved in scene understanding.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Tsung-Yi Lin (49 papers)
Michael Maire (40 papers)
Serge Belongie (125 papers)
Lubomir Bourdev (16 papers)
Ross Girshick (75 papers)
James Hays (57 papers)
Pietro Perona (78 papers)
Deva Ramanan (152 papers)
C. Lawrence Zitnick (50 papers)
Piotr Dollár (49 papers)

Citations (40,193)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos