Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (2303.05499v5)

Published 9 Mar 2023 in cs.CV

Abstract: In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at \url{https://github.com/IDEA-Research/GroundingDINO}.

PDF Abstract

The paper introduces Grounding Detection Transformer (DINO), an open-set object detector that combines the Transformer-based detector DINO with grounded pre-training to detect arbitrary objects using human inputs like category names or referring expressions. The approach hinges on integrating language into a closed-set detector to generalize to open-set concepts.

To realize effective fusion between language and vision modalities, the paper decomposes a closed-set detector into three distinct phases:

feature enhancement
language-guided query selection
cross-modality decoder for fusion

The paper augments open-set object detection evaluation to include referring expression comprehension for objects specified with attributes, in addition to evaluating on novel categories. The model achieves 52.5 average precision (AP) on the COCO detection zero-shot transfer benchmark without any training data from COCO. After fine-tuning with COCO data, Grounding DINO attains 63.0 AP and establishes a new state-of-the-art on the Object Detection in the Wild (ODinW) zero-shot benchmark with a mean 26.1 AP.

Related Work

The paper positions Grounding DINO within the context of existing detection Transformers and open-set object detection methodologies. It builds upon DINO, which itself is an end-to-end Transformer-based detector, and contrasts it with other open-set detection methods like Vision and Language knowledge Distillation (ViLD) and Grounded Language-Image Pre-training (GLIP). The paper asserts that previous works have not fully exploited multi-modal information fusion across all phases, potentially leading to suboptimal language generalization.

Method

Grounding DINO processes an (Image, Text) pair to output object boxes and corresponding noun phrases, aligning both object detection and Referring Expression Comprehension (REC) tasks. The architecture is a dual-encoder-single-decoder design, comprising:

an image backbone for image feature extraction
a text backbone for text feature extraction
a feature enhancer for image and text feature fusion
a language-guided query selection module for query initialization
a cross-modality decoder for box refinement

The image and text backbones extract multi-scale image features (e.g., using Swin Transformer) and text features (e.g., using Bidirectional Encoder Representations from Transformers (BERT)), respectively. These are then fed into a feature enhancer, which includes multiple feature enhancer layers, leveraging deformable self-attention for image features and vanilla self-attention for text feature enhancers, alongside image-to-text and text-to-image cross-attention mechanisms.

The language-guided query selection module selects features relevant to the input text as decoder queries. The query selection process outputs a set of indices, which are used to extract features for initializing queries. The decoder queries consist of content and positional parts, with the positional part formulated as dynamic anchor boxes.

The cross-modality decoder combines image and text modality features. Each cross-modality query is processed through a self-attention layer, an image cross-attention layer, a text cross-attention layer, and a feed-forward network (FFN) layer in each cross-modality decoder layer.

The paper explores two kinds of text prompts: sentence level representation and word level representation, and introduces a sub-sentence level representation to avoid unwanted word interactions, by using attention masks to block attentions among unrelated category names.

The loss function incorporates the L1 loss and the Generalized Intersection over Union (GIOU) loss for bounding box regressions. Following GLIP, a contrastive loss is applied between predicted objects and language tokens for classification. Box regression and classification costs are used for bipartite matching between predictions and ground truths, and final losses are calculated between ground truths and matched predictions with the same loss components. Auxiliary losses are added after each decoder layer and after the encoder outputs.

Experiments

The paper presents experiments on three settings:

a closed-set setting on the COCO detection benchmark
an open-set setting on zero-shot COCO, LVIS, and ODinW
a referring detection setting on RefCOCO/+/g

Ablation studies validate the model design. The models were pre-trained on the Objects365 (O365) dataset with a Swin-T backbone. Results indicate that each fusion component contributes to the final performance, with encoder fusion being the most impactful. The sub-sentence level representation is also beneficial.

Zero-Shot Transfer

On the COCO benchmark, Grounding DINO outperforms previous models in the zero-shot transfer setting, achieving improvements of +0.5 AP and +1.8 AP compared to DINO and GLIP, respectively. Grounding DINO attains 62.6 AP on COCO minival and 63.0 AP on COCO test-dev with fine-tuning on the COCO dataset.

On the LVIS benchmark, Grounding DINO outperforms GLIP under the same settings, performing better on common objects but worse on rare categories, potentially due to the 900 query design limiting the ability for long-tailed objects. Grounding DINO exhibits larger gains with more data compared to GLIP, suggesting better scalability.

On the ODinW benchmark, Grounding DINO performs well, with Grounding-DINO-T outperforming DINO on few-shot and full-shot settings. Grounding DINO with a Swin-T backbone even surpasses DINO with Swin-L on the full-shot setting. Grounding-DINO-L sets a new record on ODinW zero-shot with a 26.1 AP.

Referring Object Detection

On the REC task, Grounding DINO outperforms GLIP under the same setting. However, both models require REC data to perform well.

Transfer from DINO

Experiments on transferring a pre-trained DINO to Grounding DINO models show that freezing modules co-existing in DINO and Grounding DINO, and fine-tuning the other parameters, can achieve similar performances.

Conclusion

The paper concludes by highlighting Grounding DINO's ability to detect arbitrary objects given text queries. The paper emphasizes a tight fusion approach for better fusing cross-modality information and a sub-sentence level representation for using detection data for text prompts. The extension of open-set object detection to REC tasks and the evaluation results indicate that existing open-set detectors may require fine-tuning to perform well on REC data.