Detect Everything with Few Examples (2309.12969v4)

Published 22 Sep 2023 in cs.CV

Abstract: Few-shot object detection aims at detecting novel categories given only a few example images. It is a basic skill for a robot to perform tasks in open environments. Recent methods focus on finetuning strategies, with complicated procedures that prohibit a wider application. In this paper, we introduce DE-ViT, a few-shot object detector without the need for finetuning. DE-ViT's novel architecture is based on a new region-propagation mechanism for localization. The propagated region masks are transformed into bounding boxes through a learnable spatial integral layer. Instead of training prototype classifiers, we propose to use prototypes to project ViT features into a subspace that is robust to overfitting on base classes. We evaluate DE-ViT on few-shot, and one-shot object detection benchmarks with Pascal VOC, COCO, and LVIS. DE-ViT establishes new state-of-the-art results on all benchmarks. Notably, for COCO, DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms few-shot SoTA by 17 box APr. Further, we evaluate DE-ViT with a real robot by building a pick-and-place system for sorting novel objects based on example images. The videos of our robot demonstrations, the source code and the models of DE-ViT can be found at https://mlzxy.github.io/devit.

Citations (9)

View on Semantic Scholar

Summary

The paper presents DE-ViT, a novel detection method that eliminates finetuning by leveraging a region-propagation network to accurately localize objects.
It employs feature subspace projection to combat overfitting, balancing high base class accuracy with improved detection of novel classes.
Empirical results show DE-ViT outperforms state-of-the-art benchmarks, achieving gains of 15 mAP on COCO and 20 APr on LVIS.

Detect Everything with Few Examples: An Examination

The paper presents an innovative approach to few-shot object detection by proposing DE-ViT, a detection methodology that alleviates the need for finetuning—a common challenge in contemporary approaches. Few-shot object detection is a task where the goal is to accurately identify novel object categories with a limited number of examples. The authors address the limitations of existing methods that depend heavily on complex finetuning strategies, which often lead to a significant performance gap between base and novel classes.

Architectural Innovations

Central to the novel approach is the DE-ViT architecture, which is underpinned by a region-propagation mechanism. This mechanism enhances object localization by propagating region masks that are converted into bounding boxes using a learnable spatial integral layer. The conventional strategy of bounding box regression using pretrained Vision Transformer (ViT) features encounters challenges due to the absence of necessary coordinate information from these features. DE-ViT circumvents this by employing a region-propagation network that predicts object masks and uses a spatial integral layer to map these masks to bounding box coordinates, thereby improving localization accuracy without modifying the ViT backbone.

Feature Subspace Projection

The feature subspace projection is introduced to tackle the overfitting on base classes, a prevalent issue in few-shot settings. Instead of using prototypes as classifier weights, DE-ViT projects ViT features into a subspace through these prototypes, thus creating a feature space more resistant to overfitting. By doing so, this approach maintains a balance between achieving high accuracy on base classes while minimizing performance degradation on novel classes.

Empirical Results

The empirical evaluation demonstrates that DE-ViT achieves state-of-the-art results across several benchmarks, including Pascal VOC, COCO, and LVIS. On the COCO dataset, especially, DE-ViT surpasses few-shot state-of-the-art by 15 mAP in the 10-shot setting and by 7.2 mAP in the 30-shot setting. Additionally, it outperforms the one-shot state-of-the-art by 2.8 AP50, underscoring its effectiveness and efficiency. For the challenging LVIS dataset, DE-ViT exceeds the previous state-of-the-art by 20 box APr, which signifies a robust performance improvement in detecting novel objects.

Implications and Future Directions

DE-ViT's architecture is significant for both practical applications and theoretical developments in AI. Practically, by eliminating the need for finetuning, it facilitates broader and more accessible application of few-shot detection models. Theoretically, the combination of region propagation and feature subspace projection provides a foundation for further exploration in unsupervised and self-supervised learning contexts. This alignment may influence future research directions in developing models with enhanced generalization capabilities across diverse object categories, especially in resource-constrained environments where labeled data is scarce.

While DE-ViT addresses several limitations of current methods, future research could explore the potential integration of segmentation outputs and class-level attention mechanisms to mitigate computational overhead. These advancements could lead to more streamlined models that retain or even enhance detection efficacy, offering promising trajectories for developing more adaptive and scalable detection systems in computer vision.

In conclusion, DE-ViT represents a significant stride in few-shot object detection by blending architectural innovation with empirical excellence. As AI continues to evolve, methods like DE-ViT not only refine current capabilities but also pave the way for new avenues in adaptable and efficient machine learning applications.