Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (2311.06242v1)

Published 10 Nov 2023 in cs.CV

Abstract: We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities.

Citations (62)

View on Semantic Scholar

Summary

The paper introduces Florence-2, a unified vision model that uses a prompt-based sequence-to-sequence framework to perform multiple vision and vision-language tasks.
It co-developed the FLD-5B dataset with 5.4 billion visual annotations, enhancing accuracy in tasks such as captioning, object detection, grounding, and segmentation.
Extensive evaluations show Florence-2 excels in zero-shot settings, achieving state-of-the-art metrics and streamlining multi-task integration in vision applications.

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

The paper "Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks" introduces an advanced vision foundation model named Florence-2. This model employs a unified, prompt-based representation suitable for a wide range of computer vision and vision-language tasks. The goal is to tackle the limitations present in existing large vision models which excel in transfer learning but struggle with task diversity when prompted by simple instructions.

Objectives and Approach

Florence-2 aims to enable extensive perception capabilities, including the understanding of spatial hierarchy and semantic granularity in images. To achieve this, the authors co-developed the FLD-5B dataset, which includes 5.4 billion comprehensive visual annotations across 126 million images. The dataset was generated using an iterative process that combines automated image annotation and model refinement. The Florence-2 model leverages a sequence-to-sequence (seq2seq) structure, allowing it to generate desirable results in text forms for tasks such as captioning, object detection, grounding, and segmentation.

Model and Data Engine

Florence-2 integrates a vision encoder and a multi-modal encoder-decoder. The model architecture follows the standard encoder-decoder transformer design, capable of processing visual and language token embeddings. The training is guided by cross-entropy loss, consistently optimized across various tasks.

The data engine employs a multi-phase approach:

Initial Annotation with Specialist Models: Different models are employed to annotate images with high accuracy.
Data Filtering and Enhancement: Annotations are refined to reduce noise and inaccuracies.
Iterative Data Refinement: The process iteratively enhances the quality of the dataset by leveraging model predictions for further training.

This robust dataset and training pipeline ensure that Florence-2 provides a high-quality universal representation.

Extensive Evaluations

Florence-2 demonstrates state-of-the-art performance in zero-shot settings across multiple vision tasks:

Captioning: Achieving 135.6 CIDEr on the COCO Caption benchmark.
Object Detection: Significant improvements are shown in downstream tasks such as COCO object detection.
Visual Grounding and Referring Expressions: The model outperformed existing models by a substantial margin on Flickr30k and RefCOCO/+/g datasets.
Segmentation: Achieving high mIoU scores on ADE20K.

Implications and Future Directions

Florence-2's ability to handle various vision tasks with a single model has several implications:

Unified Framework: Simplifies the integration of multiple vision tasks, reducing the need for task-specific models and fine-tuning.
Enhanced Generalization: The model's capacity to understand detailed semantic and spatial aspects of images enables better performance in downstream applications.

Looking forward, the research can be expanded in multiple directions:

Scalability: Exploring the effects of even larger datasets and model sizes.
Real-time Applications: Optimizing Florence-2 for deployment in real-time scenarios.
Cross-Domain Transfer Learning: Investigating the model's adaptability to entirely new domains without substantial re-training.

Conclusion

Florence-2 stands as a robust vision foundation model, providing a unified multi-task learning approach with unprecedented zero-shot capabilities. Its success demonstrates the potential of comprehensive annotated datasets and multi-modal seq2seq architectures in advancing the field of computer vision. Given its strong numerical results and the versatility of applications, Florence-2 marks a noteworthy step in the development of universal vision models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/skalskip92/status/1803798306897956878

https://twitter.com/reach_vb/status/1803366557612933499

https://twitter.com/Mickael_Chen/status/1867897057044008978

https://twitter.com/TheTuringPost/status/1805282778977911071

https://twitter.com/CVCND/status/1803929122705432869

https://twitter.com/skalskip92/status/1803580198996287902

YouTube

Show All Videos