Object detection via a multi-region & semantic segmentation-aware CNN model (1505.01749v3)

Published 7 May 2015 in cs.CV, cs.LG, and cs.NE

Abstract: We propose an object detection system that relies on a multi-region deep convolutional neural network (CNN) that also encodes semantic segmentation-aware features. The resulting CNN-based representation aims at capturing a diverse set of discriminative appearance factors and exhibits localization sensitivity that is essential for accurate object localization. We exploit the above properties of our recognition module by integrating it on an iterative localization mechanism that alternates between scoring a box proposal and refining its location with a deep CNN regression model. Thanks to the efficient use of our modules, we detect objects with very high localization accuracy. On the detection challenges of PASCAL VOC2007 and PASCAL VOC2012 we achieve mAP of 78.2% and 73.9% correspondingly, surpassing any other published work by a significant margin.

Authors (2)

Spyros Gidaris (34 papers)
Nikos Komodakis (37 papers)

Citations (724)

View on Semantic Scholar

Summary

Analyzing Object Detection via Multi-Region Content and Semantic Segmentation-Aware CNN Models

Spyros Gidaris and Nikos Komodakis propose a novel object detection system leveraging a Multi-Region Convolutional Neural Network (CNN) architecture that also incorporates semantic segmentation-aware features. The paper's primary contributions lie in advancing object representation and localization techniques for object detection.

Object Representation

The core architecture, referred to as Multi-Region CNN, integrates multiple region adaptation modules, each focusing on different aspects of an object's appearance. The key aspects include object context, parts, boundaries, and surrounding context, aiming to create a richer and more discriminative object representation.

Region Components and Their Roles

Original Candidate Box: Provides a baseline by capturing the entire object appearance.
Half Boxes: Focuses on left, right, top, and bottom halves of the box to capture partial appearances and border-specific characteristics.
Central Regions: Includes scaled-down versions of the candidate box to encapsulate the core object parts.
Border Regions: Aims to capture joint appearances around object borders, enhancing localization sensitivity.
Contextual Regions: Emphasizes the context around the object to improve recognition in cluttered scenes.

Integration of Semantic Segmentation-Aware Features

The architecture extends to include features informed by semantic segmentation. A Fully Convolutional Network (FCN), trained in a weakly supervised manner using bounding box annotations, produces segmentation-aware activation maps. This training does not necessitate additional segmentation annotations, maintaining data efficiency.

Object Localization via Iterative Refinement

The proposed system enhances localization accuracy using an iterative scheme that alternates between scoring and refining candidate bounding boxes. An additional CNN-based bounding box regression module iteratively refines these proposals, while a subsequent voting scheme further refines the bounding boxes post non-maximum suppression.

Experimental Evaluation and Results

The efficacy of this method is demonstrated on the PASCAL VOC2007 and VOC2012 datasets. Notably:

The Multi-Region CNN model achieved a mean Average Precision (mAP) of 66.2% on VOC2007, outperforming the baseline R-CNN with VGG-Net.
Incorporating semantic segmentation-aware features increased mAP to 67.5%.
With the complete iterative localization scheme, the model achieved an impressive 74.9% mAP on VOC2007, setting a new state-of-the-art for that dataset and training configuration.

Detection Error Analysis

An error analysis using false positive breakdowns demonstrated substantial reductions in localization errors with the Multi-Region CNN model. This improvement underpins the model's heightened sensitivity to correctly bounding object regions.

Correlation with Bounding Box Overlap

Further experiments validated the increased correlation between predicted scores and bounding box overlap with ground truth in the Multi-Region CNN model. It was also evident from the AUC-ROC analysis that the Multi-Region CNN model was more adept at discriminating well-localized from mis-localized proposals.

Performance on VOC2012 and Use of Extra Data

On the VOC2012 test set, the model trained solely on VOC2007 data achieved a mAP of 69.1%, and training with VOC2012 data further increased this to 70.7%. Using additional data from both VOC2007 and VOC2012, the method achieved a mAP of 78.2% on VOC2007, outperforming contemporary methods including Faster R-CNN and NoC.

Future Directions

This paper’s methodology indicates promising future research avenues, including exploring more sophisticated region proposals and integrating additional contextual information. These approaches could further boost performance and address challenging scenarios like detecting multiple adjacent objects more accurately.

In conclusion, this research robustly advances methods in object detection by integrating diverse region-based representation and segmentation-informed features, coupled with an innovative iterative localization approach. The implications of these findings resonate across theoretical explorations and practical applications within computer vision.

PDF Markdown

Related Papers

YouTube

Show All Videos