Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models (2404.07973v1)

Published 11 Apr 2024 in cs.CV

Abstract: While Ferret seamlessly integrates regional understanding into the LLM to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.

Enhancing Multimodal Understanding with Ferret-v2: A Leap Forward in LLMs

Introduction to Ferret-v2

Ferret-v2 emerges as a substantial evolution of the initial Ferret model, marking a significant step forward in the integration of visual understandings, such as referring and grounding capabilities, within LLMs. By addressing the limitations of its predecessor in handling high-resolution images and enhancing fine-grained visual processing, Ferret-v2 introduces three pivotal innovations: first, an any-resolution approach for more nuanced image understanding; second, a multi-granularity visual encoding strategy; and third, a novel three-stage training paradigm aimed at meticulously aligning both global and local visual semantics with textual inputs. These advancements collectively equip Ferret-v2 with the ability to surpass previous models in tasks requiring intricate visual comprehension and interaction, as substantiated by extensive experimental validation.

Upgrading Visual Understanding

Any Resolution Processing

The inception of an any-resolution handling mechanism in Ferret-v2 significantly surpasses the traditional fixed-resolution processing methods. By divvying up images into sub-patches and leveraging a flexible CLIP encoder for processing, this approach enables the model to delve into the finer details within images, thus overcoming the constraints imposed by predetermined resolutions. Comparative analysis confirms the superior performance of this strategy over direct upsampling techniques across various tasks requiring detailed visual analysis.

Multi-Granularity Visual Encoding

Addressing the granularity disparities between global and local image perspectives, Ferret-v2 pioneers the concurrent utilization of CLIP and DINOv2 encoders for distinct visual content processing. This bifurcated encoding strategy facilitates a deeper integration of comprehensive scene understanding and meticulous detail perception, thereby enhancing the model's ability to comprehend and engage with complex visual stimuli.

Enhanced Training Paradigm

The innovative three-stage training paradigm of Ferret-v2 intricately harmonizes visual and textual elements, propelling beyond mere image-caption congruence. Initiated with image-caption alignment for basic context comprehension, the training progresses to a novel stage focusing on high-resolution dense alignment, thereby enriching the model's spatial awareness and object recognition capabilities. Subsequent fine-tuning stages refine the model's interpretive skills in accordance with user instructions, culminating in a model adept at navigating a wide spectrum of visual and textual intricacies.

Empirical Validation and Insights

Ferret-v2's capabilities were rigorously tested against a suite of benchmarks, including tasks tailored to evaluate referring and grounding proficiency, visual question answering, and modern MLLM benchmarks. The model demonstrated remarkable superiority over existing solutions, not only in finely-detailed visual understanding but also in generalized task performance, evidencing its versatile applicability. A series of ablation studies further underscore the individual contribution of each proposed innovation, reinforcing the integral role of any-resolution processing, multi-granularity encoding, and the structured training approach in achieving the observed performance leap.

The Route Ahead

The unveiling of Ferret-v2 paves the way for future explorations in multimodal LLMs, suggesting potential pathways for integrating even more granular visual processing techniques and enriching the model's training regimen with diverse, complex datasets. Its success illuminates promising prospects for the development of more intuitive, context-aware AI systems capable of navigating the intricate interplay between text and imagery with unprecedented finesse.

Acknowledgments and Ethical Considerations

The development of Ferret-v2 was supported by a collaborative effort among researchers, with special acknowledgment to those providing guidance and feedback throughout the project. It's pivotal to acknowledge the ethical dimensions associated with advanced LLMs, including Ferret-v2, especially in terms of output monitoring to mitigate the generation of harmful content. As we continue to innovate in the AI domain, fostering responsible AI development and use remains paramount.

Ferret-v2 signifies a significant milestone in the evolution of LLMs, embodying the potential of AI to transcend existing boundaries of multimodal understanding and interaction. As we venture into the field of increasingly sophisticated AI capabilities, models like Ferret-v2 stand testament to the relentless pursuit of knowledge and the unyielding potential of human ingenuity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  2. Vqa: Visual question answering. In CVPR, pp.  2425–2433, 2015.
  3. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv:2308.12966, 2023.
  4. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 2023.
  5. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv:2310.09478, 2023a.
  6. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv:2306.15195, 2023b.
  7. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023c.
  8. Uniter: Universal image-text representation learning. In ECCV, pp.  104–120. Springer, 2020.
  9. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  10. Large-scale adversarial training for vision-and-language representation learning. In NeurIPS, volume 33, pp.  6616–6628, 2020.
  11. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935, 2024.
  12. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  5356–5364, 2019.
  13. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
  14. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pp.  6700–6709, 2019.
  15. From clip to dino: Visual encoders shout in multi-modal large language models. arXiv preprint arXiv:2310.08825, 2023.
  16. Mdetr-modulated detection for end-to-end multi-modal understanding. In ICCV, pp.  1780–1790, 2021.
  17. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
  18. Segment anything. arXiv:2304.02643, 2023.
  19. Grounding language models to images for multimodal inputs and outputs. In ICML, 2023.
  20. Mmocr: A comprehensive toolbox for text detection, recognition and understanding. arXiv preprint arXiv:2108.06543, 2021.
  21. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  22. Otter: A multi-modal model with in-context instruction tuning. arXiv:2305.03726, 2023a.
  23. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023b.
  24. Semantic-sam: Segment and recognize anything at any granularity. arXiv:2307.04767, 2023c.
  25. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023d.
  26. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023e.
  27. Microsoft coco: Common objects in context. In ECCV, 2014.
  28. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
  29. Improved baselines with visual instruction tuning. arXiv preprint, 2023a.
  30. Visual instruction tuning. In NeurIPS, 2023b.
  31. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
  32. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023c.
  33. Internchat: Solving vision-centric tasks by interacting with chatbots beyond language. arXiv:2305.05662, 2023d.
  34. Kosmos-2.5: A multimodal literate model. arXiv preprint arXiv:2309.11419, 2023.
  35. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
  36. Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024.
  37. OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023.
  38. TB OpenAI. Chatgpt: Optimizing language models for dialogue. openai, 2022.
  39. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  40. Kosmos-2: Grounding multimodal large language models to the world. arXiv:2306.14824, 2023.
  41. Detgpt: Detect what you need via reasoning. arXiv:2305.14167, 2023.
  42. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In CVPR, pp.  2641–2649, 2015.
  43. Jack of all tasks, master of many: Designing general-purpose coarse-to-fine vision-language model. arXiv preprint arXiv:2312.12423, 2023.
  44. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.  8748–8763. PMLR, 2021.
  45. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp.  742–758. Springer, 2020.
  46. Towards vqa models that can read. In CVPR, pp.  8317–8326, 2019.
  47. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
  48. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  49. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  50. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pp.  23318–23340. PMLR, 2022.
  51. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023a.
  52. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. arXiv preprint arXiv:2308.01907, 2023b.
  53. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv:2305.11175, 2023c.
  54. Finetuned language models are zero-shot learners. In ICLR, 2021.
  55. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv:2303.04671, 2023a.
  56. See, say, and segment: Teaching lmms to overcome false premises. arXiv:2312.08366, 2023b.
  57. Towards large-scale 3d representation learning with multi-dataset point prompt training. arXiv:2308.09718, 2023c.
  58. Unitab: Unifying text and box outputs for grounded vision-language modeling. In ECCV, pp.  521–539. Springer, 2022.
  59. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv:2303.11381, 2023.
  60. mplug-owl: Modularization empowers large language models with multimodality. arXiv:2304.14178, 2023.
  61. Ferret: Refer and ground anything anywhere at any granularity. arXiv:2310.07704, 2023.
  62. Modeling context in referring expressions. In ECCV. Springer, 2016.
  63. Mattnet: Modular attention network for referring expression comprehension. In CVPR, pp.  1307–1315, 2018.
  64. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. arXiv preprint arXiv:2312.00849, 2023a.
  65. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023b.
  66. Osprey: Pixel understanding with visual instruction tuning. arXiv preprint arXiv:2312.10032, 2023.
  67. Griffon v2: Advancing multimodal perception with high-resolution scaling and visual-language co-referring. arXiv preprint arXiv:2403.09333, 2024.
  68. Llava-grounding: Grounded visual chat with large multimodal models. arXiv preprint arXiv:2312.02949, 2023a.
  69. Glipv2: Unifying localization and vision-language understanding. Advances in Neural Information Processing Systems, 35:36067–36080, 2022a.
  70. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv:2307.03601, 2023b.
  71. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022b.
  72. Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581, 2023.
  73. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  74. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.
  75. Segment everything everywhere all at once. arXiv:2304.06718, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Haotian Zhang (107 papers)
  2. Haoxuan You (33 papers)
  3. Philipp Dufter (21 papers)
  4. Bowen Zhang (161 papers)
  5. Chen Chen (752 papers)
  6. Hong-You Chen (21 papers)
  7. Tsu-Jui Fu (35 papers)
  8. William Yang Wang (254 papers)
  9. Shih-Fu Chang (131 papers)
  10. Zhe Gan (135 papers)
  11. Yinfei Yang (73 papers)
Citations (29)
Youtube Logo Streamline Icon: https://streamlinehq.com