Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion (2412.04424v1)

Published 5 Dec 2024 in cs.CV and cs.AI

Abstract: We present Florence-VL, a new family of multimodal LLMs (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose "depth-breath fusion (DBFusion)" to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully designed recipe of diverse open-source datasets that include high-quality image captions and instruction-tuning pairs. Our quantitative analysis and visualization of Florence-VL's visual features show its advantages over popular vision encoders on vision-language alignment, where the enriched depth and breath play important roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks covering general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, etc. To facilitate future research, our models and the complete training recipe are open-sourced. https://github.com/JiuhaiChen/Florence-VL

Authors (7)

Jiuhai Chen (26 papers)
Jianwei Yang (93 papers)
Haiping Wu (16 papers)
Dianqi Li (18 papers)
Jianfeng Gao (344 papers)
Tianyi Zhou (172 papers)
Bin Xiao (93 papers)

Summary

Overview of "Florence-VL: Enhancing Vision-LLMs with Generative Vision Encoder and Depth-Breadth Fusion"

The paper introduces Florence-VL, a family of multimodal LLMs (MLLMs) designed with enriched visual representations derived from Florence-2, a generative vision foundation model. Unlike the widely recognized vision transformers such as CLIP, which rely on contrastive learning, Florence-2 employs generative pre-training, thereby enabling the capture of a broader range of visual features. This versatility allows it to adapt seamlessly to diverse downstream tasks.

To integrate these capabilities into LLMs, Florence-VL introduces a novel feature-fusion architecture and a targeted training scheme. A key innovation in this integration is the "depth-breadth fusion" (DBFusion) method, which cohesively combines visual features extracted at varying depths and under multiple prompts. This architecture is subjected to an exhaustive pre-training process, finally followed by fine-tuning on a carefully curated dataset composed of high-quality image captions and instruction-tuning pairs. The quantitative and qualitative analyses demonstrate that Florence-VL substantially surpasses existing state-of-the-art models on numerous benchmarks, establishing its potential in both general and task-specific vision-language alignment.

Technical Contributions

Florence-VL's architecture has been meticulously engineered to enhance vision-LLMs by incorporating:

Florence-2 Visual Features: Utilize a generative vision encoder that effectively captures versatile visual representations adaptable to different computer vision tasks, such as object detection and OCR.
Depth-Breadth Fusion (DBFusion): Enable the fusion of multi-layered visual features, balancing high-level conceptual understanding with detailed perceptual specificity needed for a variety of downstream tasks.
End-to-End Pre-training with Fine-Tuning: An innovative training recipe that initially pre-trains the full model on open-sourced datasets, followed by targeted fine-tuning to optimize feature projection into LLMs.

Empirical Evaluations

Florence-VL has been empirically validated on 25 benchmarks spanning multiple categories including general VQA, perception tasks, OCR, and knowledge-intensive understanding, showing notable advancements over existing MLLMs like Cambrian. The model’s performance enhancements are attributed to the unique depth and breadth of visual features, setting it apart from other vision encoders such as CLIP and SigLIP in terms of visual-textual alignment.

The channel integration strategy adopted for feature fusion showcases superior performance and training efficiency over token integration and average pooling. The models are evaluated against baselines with varied vision encoder configurations, highlighting Florence-VL's robust alignment and improved benchmark scores.

Implications and Future Directions

Practically, Florence-VL sets a new standard for multimodal interactions, presenting a compelling case for integrating generative vision models within broader AI frameworks. Theoretically, it prompts a reevaluation of how visual and textual representations can be more effectively aligned, broadening the scope for vision-centric MLLMs.

Future avenues may involve refining the DBFusion technique to dynamically adjust feature fusion for specific tasks or developing adaptive vision encoders to enhance computational efficiency. By making models and training recipes openly accessible, this research underlines significant contributions toward community-driven advancements in AI and multimodal learning.

PDF Markdown

Related Papers

GitHub

GitHub - JiuhaiChen/Florence-VL (3 stars)

Tweets

https://twitter.com/Gradio/status/1864936635709669472

https://twitter.com/IAMJBDEL/status/1865077579566534773

https://twitter.com/gm8xx8/status/1864937763000258599

https://twitter.com/rohanpaul_ai/status/1865904775453344033

https://twitter.com/arXivGPT/status/1865822444365218062

https://twitter.com/arXivGPT/status/1866185528837583003

YouTube

Show All Videos