Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hyperbolic Learning with Synthetic Captions for Open-World Detection (2404.05016v1)

Published 7 Apr 2024 in cs.CV

Abstract: Open-world detection poses significant challenges, as it requires the detection of any object using either object class labels or free-form texts. Existing related works often use large-scale manual annotated caption datasets for training, which are extremely expensive to collect. Instead, we propose to transfer knowledge from vision-LLMs (VLMs) to enrich the open-vocabulary descriptions automatically. Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images, and incorporate these captions to train a novel detector that generalizes to novel concepts. To mitigate the noise caused by hallucination in synthetic captions, we also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings. We call our detector ``HyperLearner''. We conduct extensive experiments on a wide variety of open-world detection benchmarks (COCO, LVIS, Object Detection in the Wild, RefCOCO) and our results show that our model consistently outperforms existing state-of-the-art methods, such as GLIP, GLIPv2 and Grounding DINO, when using the same backbone.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. End-to-end object detection with transformers. In ECCV, 2020.
  2. Low-dimensional hyperbolic knowledge graph embeddings. arXiv preprint arXiv:2005.00545, 2020.
  3. Scaledet: A scalable multi-dataset object detector. In CVPR, 2023.
  4. Open-vocabulary object detection using pseudo caption labels. arXiv preprint arXiv:2303.13040, 2023.
  5. Apo-vae: Text generation in hyperbolic space. arXiv preprint arXiv:2005.00054, 2020.
  6. Dynamic head: Unifying object detection heads with attentions. In CVPR, 2021.
  7. Virtex: Learning visual representations from textual annotations. In CVPR, 2021.
  8. Hyperbolic image-text representations. In ICML, 2023.
  9. Embedding text in hyperbolic spaces. arXiv preprint arXiv:1806.04313, 2018.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  11. Dense and aligned captions (dac) promote compositional reasoning in vl models. In NeurIPS, 2023.
  12. Hyperbolic entailment cones for learning hierarchical embeddings. In ICML, 2018.
  13. Hyperbolic contrastive learning for visual representations beyond objects. In CVPR, 2023.
  14. Ross Girshick. Fast r-cnn. In ICCV, 2015.
  15. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2021.
  16. Clipped hyperbolic classifiers are super-hyperbolic classifiers. In CVPR, 2022.
  17. Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  18. Mask r-cnn. In ICCV, 2017.
  19. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  20. Mdetr-modulated detection for end-to-end multi-modal understanding. In ICCV, 2021.
  21. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
  22. Hyperbolic image embeddings. In CVPR, 2020.
  23. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
  24. Findit: Generalized localization with natural language queries. In ECCV, 2022.
  25. Lorentzian distance learning for hyperbolic representations. In ICML, 2019.
  26. Inferring concept hierarchies from text corpora via hyperbolic embeddings. arXiv preprint arXiv:1902.00913, 2019.
  27. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 11336–11344, 2020.
  28. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  29. Grounded language-image pre-training. In CVPR, 2022.
  30. Referring transformer: A one-step approach to multi-task visual grounding. In NeurIPS, 2021.
  31. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
  32. Microsoft coco: Common objects in context. In ECCV, 2014.
  33. Dq-detr: Dual query detection transformer for phrase extraction and grounding. In AAAI, 2023.
  34. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  35. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  36. Differentiating through the fréchet mean. In ICML, 2020.
  37. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
  38. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
  39. George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
  40. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In ICML, 2018.
  41. Learning transferable visual models from natural language supervision. In ICML, 2021.
  42. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
  43. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
  44. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156, 2018.
  45. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
  46. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  47. Fcos: Fully convolutional one-stage object detection. In ICCV, 2019.
  48. Image captioners are scalable vision learners too. arXiv preprint arXiv:2306.07915, 2023.
  49. Attention is all you need. In NeurIPS, 2017.
  50. Aligning bag of regions for open-vocabulary object detection. In CVPR, 2023.
  51. Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In CVPR, 2023.
  52. Unified contrastive learning in image-text-label space. In CVPR, 2022.
  53. Alip: Adaptive language-image pre-training with synthetic caption. In ICCV, 2023.
  54. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. In NeurIPS, 2022.
  55. Filip: Fine-grained interactive language-image pre-training. In ICLR, 2021.
  56. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  57. Mattnet: Modular attention network for referring expression comprehension. In CVPR, 2018.
  58. Modeling context in referring expressions. In ECCV, 2016.
  59. Open-vocabulary detr with conditional matching. In ECCV, 2022.
  60. Open-vocabulary object detection using captions. In CVPR, 2021.
  61. Lit: Zero-shot transfer with locked-image text tuning. In CVPR, 2022.
  62. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022.
  63. Glipv2: Unifying localization and vision-language understanding. In NeurIPS, 2022.
  64. Regionclip: Region-based language-image pretraining. In CVPR, 2022.
  65. Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
  66. Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461, 2021.
  67. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
  68. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Fanjie Kong (10 papers)
  2. Yanbei Chen (167 papers)
  3. Jiarui Cai (9 papers)
  4. Davide Modolo (30 papers)
Citations (3)