Hyperbolic Learning with Synthetic Captions for Open-World Detection (2404.05016v1)

Published 7 Apr 2024 in cs.CV

Abstract: Open-world detection poses significant challenges, as it requires the detection of any object using either object class labels or free-form texts. Existing related works often use large-scale manual annotated caption datasets for training, which are extremely expensive to collect. Instead, we propose to transfer knowledge from vision-LLMs (VLMs) to enrich the open-vocabulary descriptions automatically. Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images, and incorporate these captions to train a novel detector that generalizes to novel concepts. To mitigate the noise caused by hallucination in synthetic captions, we also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings. We call our detector ``HyperLearner''. We conduct extensive experiments on a wide variety of open-world detection benchmarks (COCO, LVIS, Object Detection in the Wild, RefCOCO) and our results show that our model consistently outperforms existing state-of-the-art methods, such as GLIP, GLIPv2 and Grounding DINO, when using the same backbone.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (68)

Authors (4)

Fanjie Kong (10 papers)
Yanbei Chen (167 papers)
Jiarui Cai (9 papers)
Davide Modolo (30 papers)

Citations (3)

View on Semantic Scholar

Hyperbolic Learning with Synthetic Captions for Open-World Detection (2404.05016v1)

Related Papers