Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ferret: Refer and Ground Anything Anywhere at Any Granularity (2310.07704v1)

Published 11 Oct 2023 in cs.CV and cs.CL
Ferret: Refer and Ground Anything Anywhere at Any Granularity

Abstract: We introduce Ferret, a new Multimodal LLM (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination. Code and data will be available at https://github.com/apple/ml-ferret

Ferret: Refer and Ground Anything Anywhere at Any Granularity

Ferret introduces a Multimodal LLM (MLLM) capable of comprehensive spatial understanding and grounding of images. The primary contributions of Ferret revolve around the unique ability to handle diverse region-based inputs and produce detailed, grounded outputs in a unified framework.

Key Contributions

1. Hybrid Region Representation

Ferret employs a novel hybrid region representation integrating discrete coordinates and continuous features. This allows the model to represent various region shapes such as points, bounding boxes, and free-form shapes. The discrete coordinates are quantized, and continuous features are derived using a Spatial-Aware Visual Sampler, enhancing the model's interaction with irregular shapes and sparsely distributed regions.

2. Spatial-Aware Visual Sampler

This module addresses the challenge of processing diverse region shapes. By utilizing techniques from 3D point cloud learning, it samples, gathers, and pools points to extract dense region features. This sophisticated approach enables Ferret to handle varying densities and forms of input regions effectively.

3. Comprehensive Dataset: GRIT

Ferret is trained on the GRIT dataset containing 1.1M samples. GRIT includes data on object detection, visual grounding, and complex reasoning, ensuring the model's robust performance across various tasks. The dataset is augmented with 95K hard negative samples to mitigate object hallucination and enhance model robustness.

4. Superior Performance in Classical Tasks

Ferret sets a new benchmark in tasks of referring and grounding within images. Empirical results demonstrate Ferret's capability to outperform contemporary models in region-based multimodal chatting and specific visual grounding and captioning benchmarks.

Empirical Evaluations

Referring Object Classification

Ferret demonstrates robust performance across different types of referring formats (point, box, free-form shape). For example, in the LVIS dataset, it achieves an accuracy of 68.35% for point-based, 80.46% for box-based, and 70.98% for free-form referring, significantly outperforming existing models such as Shikra and GPT4-ROI.

Grounded Image Captioning

In the Flickr30k Entities dataset, Ferret surpasses predecessors in both caption evaluation metrics (BLEU@4, METEOR, CIDEr, SPICE) and grounding evaluation metrics (F1allF1_{all}, F1locF1_{loc}), with a CIDEr score of 76.1 and a F1locF1_{loc} of 38.03.

Visual Grounding

Ferret excels in standard visual grounding benchmarks, including RefCOCO, RefCOCO+, and RefCOCOg, achieving accuracy rates as high as 92.41% in certain tasks, showcasing marked improvements over previously leading approaches such as MDETR and Kosmos-2.

Ferret-Bench for Multimodal Chatting

Ferret is also assessed on the newly proposed Ferret-Bench. This evaluation covers tasks necessitating intricate referring and grounding within conversational contexts. For instance, in the Referring Reasoning task, Ferret outperforms models like Kosmos-2 and Shikra substantially. The superior performance is evident in tasks that demand detailed spatial understanding and context-aware reasoning.

Implications and Future Directions

The implications of Ferret's capabilities are multifaceted:

  • Practical Applications: Ferret's advanced spatial understanding and grounding abilities have profound implications for real-world applications such as autonomous navigation, advanced human-computer interaction, and augmented reality.
  • Theoretical Impact: From a theoretical standpoint, Ferret's hybrid representation and spatial-aware sampling provide a new paradigm for integrating continuous and discrete data in MLLMs, potentially guiding future research on multimodal learning.
  • Future Research: Moving forward, there are intriguing avenues to explore, such as enhancing Ferret to output segmentation masks instead of just bounding boxes. This could bridge the gap between coarse localization and fine-grained scene understanding.

Conclusion

Ferret represents a significant advancement in the domain of multimodal LLMs by setting new benchmarks in spatial understanding and grounding tasks. Its innovative hybrid region representation and spatial-aware visual sampler enable it to tackle diverse and complex visual inputs. With its comprehensive training dataset and robust performance across various benchmarks, Ferret not only addresses current limitations but also opens up new possibilities for future research and applications in AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.
  2. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
  3. Openflamingo, March 2023. URL https://doi.org/10.5281/zenodo.7733589.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437, 2023a.
  6. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
  7. Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021.
  8. A unified sequence interface for vision tasks. arXiv preprint arXiv:2206.07669, 2022a.
  9. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022b.
  10. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023c.
  11. Uniter: Universal image-text representation learning. In ECCV, 2020.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  13. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  14. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  15. Transvg: End-to-end visual grounding with transformers. In ICCV, 2021.
  16. PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  17. Large-scale adversarial training for vision-and-language representation learning. NeurIPS, 2020.
  18. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  5356–5364, 2019.
  19. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp.  2961–2969, 2017.
  20. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1780–1790, 2021.
  21. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp.  787–798, 2014.
  22. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  23. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023a.
  24. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823, 2023b.
  25. Computational generation of referring expressions: A survey. Computational Linguistics, 38(1):173–218, 2012.
  26. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  27. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  28. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023.
  29. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
  30. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 2023b.
  31. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
  32. M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTit: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387, 2023d.
  33. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10965–10975, 2022.
  34. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023e.
  35. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014.
  36. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a.
  37. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  38. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  39. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
  40. Comprehension-guided referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  7102–7111, 2017.
  41. Learning to generate grounded visual captions without localization supervision. In ECCV, 2020.
  42. Rethinking network design and local geometry in point cloud: A simple residual mlp framework. arXiv preprint arXiv:2202.07123, 2022.
  43. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  11–20, 2016.
  44. Modeling context between objects for referring expression understanding. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp.  792–807. Springer, 2016.
  45. OpenAI. GPT-4 technical report. https://arxiv.org/abs/2303.08774, 2023a.
  46. OpenAI. Gpt-4 technical report. arXiv, 2023b.
  47. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  48. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp.  2641–2649, 2015.
  49. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  652–660, 2017a.
  50. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017b.
  51. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  52. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  53. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  8430–8439, 2019.
  54. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.
  55. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  56. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  57. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a.
  58. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, 2022b.
  59. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
  60. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (tog), 38(5):1–12, 2019.
  61. Simvlm: Simple visual language model pretraining with weak supervision. In ICLR, 2022c.
  62. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022.
  63. Unitab: Unifying text and box outputs for grounded vision-language modeling. In European Conference on Computer Vision, pp.  521–539. Springer, 2022.
  64. The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023.
  65. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  66. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp.  69–85. Springer, 2016.
  67. A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  7282–7290, 2017.
  68. Mattnet: Modular attention network for referring expression comprehension. In CVPR, 2018.
  69. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023.
  70. Contextual object detection with multimodal large language models. arXiv preprint arXiv:2305.18279, 2023.
  71. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6720–6731, 2019.
  72. Glipv2: Unifying localization and vision-language understanding. Advances in Neural Information Processing Systems, 35:36067–36080, 2022.
  73. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023.
  74. Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581, 2023.
  75. Grounded video description. In CVPR, 2019.
  76. More grounded image captioning by distilling image-text matching model. In CVPR, 2020.
  77. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
  78. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023b.
  79. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Haoxuan You (33 papers)
  2. Haotian Zhang (107 papers)
  3. Zhe Gan (135 papers)
  4. Xianzhi Du (30 papers)
  5. Bowen Zhang (161 papers)
  6. Zirui Wang (83 papers)
  7. Liangliang Cao (52 papers)
  8. Shih-Fu Chang (131 papers)
  9. Yinfei Yang (73 papers)
Citations (224)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

  1. GitHub - apple/ml-ferret (8,305 stars)
Youtube Logo Streamline Icon: https://streamlinehq.com