Zero-Shot Object Detection (1804.04340v2)

Published 12 Apr 2018 in cs.CV

Abstract: We introduce and tackle the problem of zero-shot object detection (ZSD), which aims to detect object classes which are not observed during training. We work with a challenging set of object classes, not restricting ourselves to similar and/or fine-grained categories as in prior works on zero-shot classification. We present a principled approach by first adapting visual-semantic embeddings for ZSD. We then discuss the problems associated with selecting a background class and motivate two background-aware approaches for learning robust detectors. One of these models uses a fixed background class and the other is based on iterative latent assignments. We also outline the challenge associated with using a limited number of training classes and propose a solution based on dense sampling of the semantic label space using auxiliary data with a large number of categories. We propose novel splits of two standard detection datasets - MSCOCO and VisualGenome, and present extensive empirical results in both the traditional and generalized zero-shot settings to highlight the benefits of the proposed methods. We provide useful insights into the algorithm and conclude by posing some open questions to encourage further research.

Authors (5)

Ankan Bansal (15 papers)
Karan Sikka (32 papers)
Gaurav Sharma (51 papers)
Rama Chellappa (190 papers)
Ajay Divakaran (43 papers)

Citations (343)

View on Semantic Scholar

Summary

Zero-Shot Object Detection: A Comprehensive Analysis

The paper authored by Ankan Bansal et al. addresses the novel and challenging problem of Zero-Shot Object Detection (ZSD). ZSD is an extension of Zero-Shot Learning (ZSL), where the objective is to detect and classify object instances from categories that were not seen during the training phase. This stands in contrast to the traditional object detection paradigm, which relies extensively on large annotated datasets for both training and validation.

The authors highlight the limitations of existing object detection frameworks in scaling to large numbers of object categories, referencing the prohibitive costs associated with acquiring extensive bounding box annotations. In this context, the paper represents an ambitious attempt to bridge the gap between zero-shot vision and structured object detection tasks.

Methodological Contributions

Visual-Semantic Embeddings: A central pillar of the proposed method is the adaptation of visual-semantic embeddings, which have been previously leveraged for zero-shot classification but not extensively explored for detection tasks. The embeddings assist in mapping both image features and semantic class labels into a shared representation space, facilitating the prediction of unseen classes.
Background Integration in ZSD: The authors address the often ignored computational challenge of object-background discrimination in a ZSD framework. Two innovative background modeling techniques are proposed:
- Statically Assigned Background (SB): A rudimentary approach where a fixed vector represents a monolithic background class in the embedding space.
- Latent Assignment Based (LAB): An iterative algorithm that utilizes Expectation-Maximization-like procedures to allocate multiple latent background labels, distributing them over a larger open vocabulary.
Dense Sampling of Semantic Space (DSES): To alleviate the sparsity in the semantic embedding space, the authors introduce auxiliary data from external sources, enriching the diversity of visual concepts and improving alignment between visual and semantic domains.

Evaluation and Empirical Analysis

The empirical evaluation was conducted on novel splits of two extensively recognized datasets: MSCOCO and VisualGenome. The authors demonstrate the efficacy of their proposed models by outperforming baselines in various settings, particularly noting the improvement of the DSES technique on MSCOCO due to enhanced label diversity. Furthermore, LAB shows superior outcomes on VisualGenome, attributed to its capability to handle a wider spectrum of visual backgrounds effectively.

Recall at top-K detections and mean Average Precision (mAP) are utilized as evaluation metrics. The paper reports encouraging results across different Intersection over Union (IoU) thresholds and emphasizes that background-aware approaches, LAB in particular, result in superior detection quality, especially in cluttered backgrounds.

Implications and Future Directions

The paper opens several promising directions for future research. Leveraging semantically cohesive hierarchical information and integrating lexical ontology are proposed as potential avenues to improve ZSD. Additionally, exploring bounding box regression and hard-negative mining in the absence of supervision remains an open challenge that warrants further investigation.

In conclusion, the paper by Bansal et al. makes significant methodological strides towards making Zero-Shot Object Detection a practical reality. Their exploration into integrating semantic embeddings and addressing background complexity sets a foundation for advancement in real-world object detection scenarios, where truly scalable and generalizable models can operate across a diverse range of object categories unseen at training.

PDF Markdown

Related Papers

Find Related Papers