DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding (2411.14347v3)

Published 21 Nov 2024 in cs.CV

Abstract: In this paper, we introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research with the best open-world object detection performance to date. DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding. To make long-tailed object detection easy, DINO-X extends its input options to support text prompt, visual prompt, and customized prompt. With such flexible prompt options, we develop a universal object prompt to support prompt-free open-world detection, making it possible to detect anything in an image without requiring users to provide any prompt. To enhance the model's core grounding capability, we have constructed a large-scale dataset with over 100 million high-quality grounding samples, referred to as Grounding-100M, for advancing the model's open-vocabulary detection performance. Pre-training on such a large-scale grounding dataset leads to a foundational object-level representation, which enables DINO-X to integrate multiple perception heads to simultaneously support multiple object perception and understanding tasks, including detection, segmentation, pose estimation, object captioning, object-based QA, etc. Experimental results demonstrate the superior performance of DINO-X. Specifically, the DINO-X Pro model achieves 56.0 AP, 59.8 AP, and 52.4 AP on the COCO, LVIS-minival, and LVIS-val zero-shot object detection benchmarks, respectively. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, improving the previous SOTA performance by 5.8 AP and 5.0 AP. Such a result underscores its significantly improved capacity for recognizing long-tailed objects.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces DINO-X, which uses a Transformer encoder-decoder and pre-training on over 100M samples to enhance open-world object detection performance.
The model employs diverse prompting techniques—including text, visual, and customized prompts—to enable prompt-free detection even in long-tailed scenarios.
Empirical results show DINO-X Pro achieving AP scores of 56.0 on COCO and over 52 on LVIS, demonstrating its superior multi-task vision capabilities.

Overview of DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

The paper introduces DINO-X, a comprehensive object-centric vision model engineered for open-world object detection and understanding. Developed by the IDEA Research Team, DINO-X claims to advance the state of the art in open-world object detection performance. It leverages a Transformer-based encoder-decoder architecture in alignment with Grounding DINO 1.5, to facilitate multi-task object-level understanding in open-world contexts.

DINO-X is distinctive in its support for diverse prompting techniques, including text, visual, and customized prompts. Through these mechanisms, the model aims to address limitations inherent to previous prompt-required models and is particularly adept at facilitating long-tailed object detection scenarios. Its ability to perform prompt-free detection is one of its salient features, empowering the model to identify any object in an image without user-provided prompts.

Moreover, DINO-X is built on extensive pre-training using the Grounding-100M dataset, composed of over 100 million high-quality grounding samples. This large-scale pre-training endows the model with foundational object-level representations, allowing it to incorporate multiple perception heads for tasks such as detection, segmentation, pose estimation, and object captioning.

Model Variants and Performance

DINO-X is available in two configurations: DINO-X Pro, designed for enhanced perception capabilities across various scenarios, and DINO-X Edge, optimized for faster inference on edge devices. Utilizing a ViT-based backbone, DINO-X integrates additional perception heads into a unified framework, facilitating outputs from bounding boxes to fine-grained masks and captions.

Empirical evaluations emphasize the model's superior performance. The DINO-X Pro achieves notable AP scores across zero-shot benchmarks, attaining 56.0 AP on COCO, 59.8 AP on LVIS-minival, and 52.4 AP on LVIS-val. Particularly on rare classes of these datasets, it surpasses previous performances, illustrating its improved ability to recognize long-tailed objects.

Technical Contributions

A key contribution of DINO-X is its employment of diverse prompt types. Text prompts generally address common detection scenarios, while visual prompts extend detection capabilities in scenarios where textual descriptions are insufficient. Customized prompts, which can be pre-defined or user-tuned for domain-specific applications, enhance the model's adaptability.

The establishment of the Grounding-100M dataset constitutes another critical contribution, serving as the basis for pre-training aimed at reinforcing the model's grounding performance. The integration of different perception heads enables DINO-X to simultaneously handle multiple vision tasks, offering more comprehensive object-level image understanding.

Implications and Future Directions

The development of DINO-X has significant implications for practical applications, such as aiding robots in dynamic environments, enhancing the sensory inputs of autonomous vehicles, and reducing multimodal hallucinations in LLMs. Its prompt-free detection feature facilitates practical utility without cumbersome user inputs.

Looking forward, potential developments could focus on optimizing task-specific heads, enhancing segmentation accuracy, and expanding capabilities for broader perception tasks. Furthermore, exploration of efficient deployment strategies for the Edge variant could enable its application in real-world edge scenarios requiring rapid inference.

Conclusion

In conclusion, DINO-X presents a robust advancement in open-world object detection, integrating flexible prompt mechanisms and a comprehensive dataset for foundational model training. The system demonstrates enhanced zero-shot performance and long-tailed object recognition, positioning it as a valuable tool for both theoretical exploration and practical application in the field of AI-driven perception and understanding tasks.

Related Papers

Tweets

https://twitter.com/leizhangcs/status/1859966441606480023

https://twitter.com/Almorgand/status/1872598770179035191

https://twitter.com/taziku_co/status/1860260430754775493

https://twitter.com/taziku_co/status/1872611942185226402

https://twitter.com/gm8xx8/status/1860120601731400127

https://twitter.com/WilliamLamkin/status/1860031245419479523

YouTube

Show All Videos