Florence: A New Foundation Model for Computer Vision (2111.11432v1)

Published 22 Nov 2021 in cs.CV, cs.AI, and cs.LG

Abstract: Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP, ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a cross-modal shared representation, we introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). By incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition. Moreover, Florence demonstrates outstanding performance in many types of transfer learning: fully sampled fine-tuning, linear probing, few-shot transfer and zero-shot transfer for novel images and objects. All of these properties are critical for our vision foundation model to serve general purpose vision tasks. Florence achieves new state-of-the-art results in majority of 44 representative benchmarks, e.g., ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of 97.18, 62.4 mAP on COCO fine tuning, 80.36 on VQA, and 87.8 on Kinetics-600.

Citations (800)

View on Semantic Scholar

Summary

The paper presents Florence as a versatile computer vision foundation model that leverages unified image-text contrastive learning from 900 million image-text pairs.
It employs a dual-phase training process with advanced transformer architectures, achieving state-of-the-art results on benchmarks like ImageNet and COCO.
Florence’s adaptable design minimizes task-specific fine-tuning, paving the way for efficient real-world AI applications and integrative vision-language research.

Florence: A New Foundation Model for Computer Vision

This paper, titled "Florence: A New Foundation Model for Computer Vision," presents a comprehensive paper on the development and capabilities of a novel computer vision foundation model named Florence. Developed by a collaborative team at Microsoft, the core objective of Florence is to serve as a versatile, general-purpose vision system adaptable to various downstream computer vision tasks with minimal need for task-specific customization. This trait is aligned with the broader endeavor to integrate AI systems that can generalize well across disparate tasks without extensive human intervention.

Florence stands out in its design to extend visual representations across multiple dimensions—ranging from coarse (scene-level) to fine-grained (object-level) classifications, static imagery to dynamic videos, and from RGB data to multimodal inputs including captions, depth sensors, etc. Unlike existing models, such as CLIP and ALIGN, primarily focused on mapping image and textual data to cross-modal shared representations, Florence enhances representation breadth and depth across these extended dimensions.

Unified Learning from Large-scale Datasets

In the construction of Florence, the research leverages a sophisticated data curation process that aggregates 900 million image-text pairs from publicly available web resources. The noisy nature of web-crawled data is mitigated through rigorous filtering and a unified learning objective. The research employs a novel unified image-text contrastive learning (UniCL) methodology to enhance the learning efficiency from the web data. This dual-phase training process includes an initial phase where all data, including augmented texts, is used, followed by a secondary phase that excludes augmented texts to optimize learning from natural web descriptions.

Advanced Transformer Architectures

Florence utilizes an advanced two-tower architecture comprising a 12-layer transformer for textual encoding and a hierarchical Vision Transformer (CoSwin) with dynamic heads for image encoding. This combination not only facilitates efficient feature extraction but also maintains computational feasibility necessary for dense prediction tasks.

For large-scale training purposes, the model leverages an optimized infrastructure integrating techniques like Zero Redundancy Optimizer (ZeRO), activation checkpointing, mixed-precision training, and gradient caching. These optimizations significantly reduce memory consumption and escalate training throughput, leading to substantial efficiency improvements.

Outstanding Performance Metrics

Florence demonstrates remarkable performance across multiple challenging benchmarks:

Zero-Shot Classification: Florence achieves outstanding results on 12 datasets, including a top-1 accuracy of 83.74% and a top-5 accuracy of 97.18% on ImageNet-1K, surpassing existing state-of-the-art models.
Linear Probe Classification: The model excels in linear probe evaluations across 11 datasets, providing superior performance over competing methods.
Object Detection: Florence sets new performance standards in object detection tasks on datasets such as COCO, Object365, and Visual Genome with minimal adaptation.
Vision-Language Tasks: The integration of the METER adapter allows Florence to achieve leading performance in Visual Question Answering (VQA), reaching an accuracy of 80.36% on the test-std benchmark.
Text-to-Video Retrieval and Video Action Recognition: Leveraging adaptations for video tasks, Florence exhibits exceptional zero-shot text-to-video retrieval capabilities and sets new records in video action recognition on Kinetics-400 and Kinetics-600 datasets.

Implications and Future Prospects

The implications of Florence's successful deployment are significant, both practical and theoretical. Practically, the ability of Florence to perform exceptionally across diverse tasks with minimal fine-tuning paves the way for more efficient real-world AI applications, reducing the reliance on extensive labeled data and domain specialization. Theoretically, the hierarchical and adaptable architecture of Florence provides a robust framework for future research and development endeavors in AI, especially in building integrative models similar to the XYZ-code for multimodal AI capabilities.

Florence's promising results in zero-shot learning highlight the potential for further reducing the data and resource barriers in developing adaptable AI systems. Future developments might focus on integrating more vision tasks, such as depth estimation and tracking, and enhancing the model's applicability in real-time, real-world applications. This research marks a considerable step towards creating holistic AI systems capable of human-like generalization and adaptability, setting a solid foundation for the advancement of foundation models in computer vision.

PDF Markdown

Related Papers

YouTube

Show All Videos