Hyperbolic Learning with Synthetic Captions for Open-World Detection (2404.05016v1)
Abstract: Open-world detection poses significant challenges, as it requires the detection of any object using either object class labels or free-form texts. Existing related works often use large-scale manual annotated caption datasets for training, which are extremely expensive to collect. Instead, we propose to transfer knowledge from vision-LLMs (VLMs) to enrich the open-vocabulary descriptions automatically. Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images, and incorporate these captions to train a novel detector that generalizes to novel concepts. To mitigate the noise caused by hallucination in synthetic captions, we also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings. We call our detector ``HyperLearner''. We conduct extensive experiments on a wide variety of open-world detection benchmarks (COCO, LVIS, Object Detection in the Wild, RefCOCO) and our results show that our model consistently outperforms existing state-of-the-art methods, such as GLIP, GLIPv2 and Grounding DINO, when using the same backbone.
- End-to-end object detection with transformers. In ECCV, 2020.
- Low-dimensional hyperbolic knowledge graph embeddings. arXiv preprint arXiv:2005.00545, 2020.
- Scaledet: A scalable multi-dataset object detector. In CVPR, 2023.
- Open-vocabulary object detection using pseudo caption labels. arXiv preprint arXiv:2303.13040, 2023.
- Apo-vae: Text generation in hyperbolic space. arXiv preprint arXiv:2005.00054, 2020.
- Dynamic head: Unifying object detection heads with attentions. In CVPR, 2021.
- Virtex: Learning visual representations from textual annotations. In CVPR, 2021.
- Hyperbolic image-text representations. In ICML, 2023.
- Embedding text in hyperbolic spaces. arXiv preprint arXiv:1806.04313, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Dense and aligned captions (dac) promote compositional reasoning in vl models. In NeurIPS, 2023.
- Hyperbolic entailment cones for learning hierarchical embeddings. In ICML, 2018.
- Hyperbolic contrastive learning for visual representations beyond objects. In CVPR, 2023.
- Ross Girshick. Fast r-cnn. In ICCV, 2015.
- Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2021.
- Clipped hyperbolic classifiers are super-hyperbolic classifiers. In CVPR, 2022.
- Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
- Mask r-cnn. In ICCV, 2017.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Mdetr-modulated detection for end-to-end multi-modal understanding. In ICCV, 2021.
- Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
- Hyperbolic image embeddings. In CVPR, 2020.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
- Findit: Generalized localization with natural language queries. In ECCV, 2022.
- Lorentzian distance learning for hyperbolic representations. In ICML, 2019.
- Inferring concept hierarchies from text corpora via hyperbolic embeddings. arXiv preprint arXiv:1902.00913, 2019.
- Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 11336–11344, 2020.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Grounded language-image pre-training. In CVPR, 2022.
- Referring transformer: A one-step approach to multi-task visual grounding. In NeurIPS, 2021.
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Dq-detr: Dual query detection transformer for phrase extraction and grounding. In AAAI, 2023.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Differentiating through the fréchet mean. In ICML, 2020.
- Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
- George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
- Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In ICML, 2018.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
- Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
- Object hallucination in image captioning. arXiv preprint arXiv:1809.02156, 2018.
- Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
- Fcos: Fully convolutional one-stage object detection. In ICCV, 2019.
- Image captioners are scalable vision learners too. arXiv preprint arXiv:2306.07915, 2023.
- Attention is all you need. In NeurIPS, 2017.
- Aligning bag of regions for open-vocabulary object detection. In CVPR, 2023.
- Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In CVPR, 2023.
- Unified contrastive learning in image-text-label space. In CVPR, 2022.
- Alip: Adaptive language-image pre-training with synthetic caption. In ICCV, 2023.
- Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. In NeurIPS, 2022.
- Filip: Fine-grained interactive language-image pre-training. In ICLR, 2021.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Mattnet: Modular attention network for referring expression comprehension. In CVPR, 2018.
- Modeling context in referring expressions. In ECCV, 2016.
- Open-vocabulary detr with conditional matching. In ECCV, 2022.
- Open-vocabulary object detection using captions. In CVPR, 2021.
- Lit: Zero-shot transfer with locked-image text tuning. In CVPR, 2022.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022.
- Glipv2: Unifying localization and vision-language understanding. In NeurIPS, 2022.
- Regionclip: Region-based language-image pretraining. In CVPR, 2022.
- Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
- Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461, 2021.
- Objects as points. arXiv preprint arXiv:1904.07850, 2019.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
- Fanjie Kong (10 papers)
- Yanbei Chen (167 papers)
- Jiarui Cai (9 papers)
- Davide Modolo (30 papers)