Open-vocabulary Object Detection via Vision and Language Knowledge Distillation (2104.13921v3)

Published 28 Apr 2021 in cs.CV, cs.AI, and cs.LG

Abstract: We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. ViLD obtains 16.1 mask AP$r$ with a ResNet-50 backbone, even outperforming the supervised counterpart by 3.8. When trained with a stronger teacher model ALIGN, ViLD achieves 26.3 AP$_r$. The model can directly transfer to other datasets without finetuning, achieving 72.2 AP${50}$ on PASCAL VOC, 36.6 AP on COCO and 11.8 AP on Objects365. On COCO, ViLD outperforms the previous state-of-the-art by 4.8 on novel AP and 11.4 on overall AP. Code and demo are open-sourced at https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild.

PDF Abstract

Open-Vocabulary Object Detection via Vision and Language Knowledge Distillation

In recent advancements in computer vision, open-vocabulary object detection has emerged as a pivotal area of research. This paper introduces ViLD, a novel approach addressing the fundamental challenge of limited training data availability for object detection tasks involving arbitrary text inputs.

To circumvent the high costs associated with manually extending detection datasets to include a broader range of object categories, the authors propose a method that leverages existing pretrained image classification models through knowledge distillation. Specifically, ViLD employs a Vision and Language Knowledge Distillation strategy to transfer knowledge from a pretrained model (the teacher) to a two-stage object detector (the student).

Methodology

The methodology can be divided into several key components:

Teacher-Student Framework:
- The teacher model, pretrained on an extensive image-text dataset, encodes category texts and image regions of object proposals.
- The student model is a two-stage object detector where region embeddings of detected boxes are aligned with the embeddings inferred by the teacher model.
Training with Text Embeddings (ViLD-text):
- The pretrained text encoder generates text embeddings for category labels. The student is trained to classify detected regions by aligning its embeddings with these text-generated embeddings, leveraging the semantic richness of text-visual correlations available in the large-scale pretraining phase.
Distilling Image Embeddings (ViLD-image):
- Regions proposed by the student are fed into the teacher model to obtain image embeddings. This step involves training the student to mimic these embeddings, enabling a generalized understanding beyond the base categories.
Combination via Weighted Sum:
- The paper combines ViLD-text and ViLD-image using a weighted sum of their objective functions, ensuring the alignment of embeddings is effective across varied categories.

Experimental Results

The methodology is evaluated on prominent benchmarks such as LVIS and COCO, with noteworthy results:

On LVIS with a ResNet-50 backbone, ViLD achieves a mask AP $_r$ of 16.1, surpassing the supervised counterpart by 3.8 points. With a stronger teacher like ALIGN, the model's performance extends to 26.3 AP $_r$ .
The method shows impressive transferrable capabilities, performing well on datasets like PASCAL VOC, COCO, and Objects365 without additional fine-tuning, showcasing the robustness and adaptability of the approach.

Numerical Insights and Implications

The numerical comparisons illustrate that ViLD variants, and in particular, the ViLD-ensemble approach, achieve superior performance across novel category detections. The detailed breakdown of performance across different distillation weights and loss functions demonstrates that a balance between textual and visual embeddings can optimize model efficacy. Furthermore, leveraging stronger teacher networks like ALIGN yields significant improvements, emphasizing the impact of high-quality pretrained models.

Implications and Future Directions

Practically, the methodology provides a scalable solution for object detection tasks requiring adaptive and broad-category recognitions. The flexibility to incorporate both image and text embeddings offers a nuanced understanding of visual concepts, which can be particularly useful in applications like autonomous driving, where detecting rare or unseen objects in real-time is crucial.

Theoretically, the approach paves the way for further explorations in knowledge distillation, particularly in drawing richer semantic knowledge from extensive multimodal datasets. Future developments might focus on enhancing the teacher-student interaction, exploring more sophisticated transfer learning techniques, and fine-tuning for higher-level attribute recognition and context-aware detection capabilities.

In conclusion, ViLD represents a significant step forward in the field of open-vocabulary object detection, demonstrating how leveraging pretrained models and knowledge distillation can overcome data scarcity issues and advance the state-of-the-art in detecting and recognizing a wide array of object categories.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Xiuye Gu (17 papers)
Tsung-Yi Lin (49 papers)
Weicheng Kuo (23 papers)
Yin Cui (45 papers)

Citations (792)

View on Semantic Scholar

Related Papers

Find Related Papers