Papers
Topics
Authors
Recent
Search
2000 character limit reached

Florence: A New Foundation Model for Computer Vision

Published 22 Nov 2021 in cs.CV, cs.AI, and cs.LG | (2111.11432v1)

Abstract: Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP, ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a cross-modal shared representation, we introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). By incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition. Moreover, Florence demonstrates outstanding performance in many types of transfer learning: fully sampled fine-tuning, linear probing, few-shot transfer and zero-shot transfer for novel images and objects. All of these properties are critical for our vision foundation model to serve general purpose vision tasks. Florence achieves new state-of-the-art results in majority of 44 representative benchmarks, e.g., ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of 97.18, 62.4 mAP on COCO fine tuning, 80.36 on VQA, and 87.8 on Kinetics-600.

Citations (800)

Summary

  • The paper presents Florence, a foundation model that integrates multimodal data—from static images to dynamic videos—to achieve high performance in zero-shot and few-shot tasks.
  • Its innovative design employs a CoSwin Transformer and unified image-text contrastive learning to balance visual granularity with computational efficiency.
  • Florence demonstrates superior performance, achieving 83.74% top-1 accuracy on ImageNet-1K and setting new benchmarks in object detection across multiple datasets.

Florence: A New Foundation Model for Computer Vision

Florence introduces a foundation model tailored for computer vision challenges by bridging various modalities, expanding existing representation capabilities, and achieving impressive performance in numerous benchmarks.

Space-Time-Modality Spectrum

Florence's overarching framework seeks to address the expansive space-time-modality spectrum in computer vision. This entails transitioning representations from static images to dynamic videos and incorporating multiple modalities beyond merely RGB, encompassing textual descriptions, depth maps, and more, aiming to tackle tasks from scene-level classification to fine-grained object detection. Figure 1

Figure 1: Common computer vision tasks are mapped to a Space-Time-Modality space. A computer vision foundation model should serve as a general-purpose vision system for all of these tasks.

Building Florence

Florence employs a comprehensive workflow that integrates data curation, unified learning objectives, innovative Transformer architectures, and flexible adaptation to downstream tasks. This enables a seamless integration into complex vision systems, capable of handling multimedia applications ranging from object detection to action recognition. Figure 2

Figure 2: Overview of building Florence. Our workflow consists of data curation, unified learning, Transformer architectures and adaption. It shows the foundation model can be adapted to various downstream tasks and finally integrated into modern computer vision system to power real-world vision and multimedia applications.

Unified Image-Text Contrastive Learning

Florence leverages a two-tower architecture with distinct encoders for images and text, promoting effective multimodal representation learning. Utilizing the UniCL framework, Florence enhances learning by equating the image-text pairings with positive associations within a shared embedding space, thereby accommodating the rich, diverse data from web sources.

Transformer Architectures

The core of Florence's architectural prowess is its hierarchical Vision Transformer, CoSwin, which optimizes for scale-invariant image properties and computational efficiency. By integrating convolutional mechanisms into its structure, CoSwin achieves an intuitive balance between visual granularity and attention-based learning, critical for tasks requiring dense spatial understanding.

Performance and Adaptation

Florence demonstrates robust performance across a series of vision benchmarks, including zero-shot and few-shot learning paradigms, showcasing superior accuracy and adaptability in image and video recognition tasks.

  • Zero-shot classification: Florence achieves 83.74% top-1 accuracy in ImageNet-1K, outperforming leading models by a significant margin.
  • Object detection: Utilizing Dynamic Head adaption, it not only sets new state-of-the-art benchmarks in datasets like COCO and Object365 but also exhibits strong zero-shot object detection capabilities, highlighting its potential in understanding novel and diverse scenarios. Figure 3

    Figure 3: Our fine-tuned detection results on COCO (sparse object boxes), Object365 (dense object boxes), Visual Genome (w/ object attributes), and zero-shot transfer results on 11 downstream detection tasks. Boxes with different colors denote different object categories.

Scalable Training Infrastructure

To facilitate learning at such scales, Florence employs multiple techniques: ZeRO optimization, activation checkpointing, and gradient caching. This reduces computational overhead on GPUs, allowing for larger batch-size training and hence more performant model updates. Figure 4

Figure 4: GPU memory reduction for various batch sizes. We compared the profiling between Torch (w/o optimization) and Florence (w/ optimization) on various numbers of GPUs.

Conclusion

Florence represents a significant advancement in creating a multi-purpose foundation model for computer vision, enabling efficient, scalable, and versatile application across a spectrum of challenges. Its design principles lay the groundwork for future models in integrative AI systems that aim to be foundational to human-like visual understanding and interaction. Future directions involve further expanding its modality support and improving zero-shot task applications.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.