Florence-2: Unified Prompt-Based Vision Model

Updated 19 November 2025

Florence-2 is a prompt-based vision foundation model that converts various vision tasks into a unified sequence-to-sequence format using natural language prompts.
It employs a Dual-Attention Vision Transformer (DaViT) coupled with an encoder-decoder architecture to merge generative and multi-granular visual features efficiently.
Trained on billions of annotated samples, Florence-2 supports both zero-shot and fine-tuned applications, achieving state-of-the-art performance across diverse benchmarks.

Florence-2 is a prompt-based vision foundation model that provides a unified sequence-to-sequence representation for a broad spectrum of computer vision and vision-language tasks. It underpins a range of recent multimodal LLMs (MLLMs) by integrating both generative feature extraction and multi-granular visual representations, thereby establishing a state-of-the-art platform for both zero-shot and fine-tuned applications across diverse domains.

1. Model Architecture and Core Principles

Florence-2 is architected as a generative vision-language transformer, employing a Dual-Attention Vision Transformer (DaViT) as its visual backbone. Given an input image $I \in \mathbb{R}^{H \times W \times 3}$ , the DaViT encoder tokenizes the image into $N_v$ visual tokens $V \in \mathbb{R}^{N_v \times d}$ , using a combination of spatial self-attention and cross-scale attention mechanisms across hierarchical stages. Each patch is linearly projected and layer-normalized:

$V = \mathrm{LayerNorm}(W_p \cdot \textrm{patches}(I) + b_p).$

A prompt-based encoder–decoder transformer (with $L$ encoder and $L'$ decoder layers) then ingests the concatenation $X = [V;T]$ of visual tokens and a short text prompt $T$ , processing this combined sequence via self- and cross-attention. The decoder autoregressively generates task-specific outputs, ranging from image captioning to bounding-box prediction and OCR.

Training is end-to-end with a cross-entropy loss over the target sequence $Y$ :

$\mathcal{L}_{\mathrm{task}} = -\sum_{t=1}^{|Y|} \log p(Y_t \mid X, Y_{<t}).$

Florence-2’s architecture supports both low-level (texture, edges) and high-level (semantics, spatial relations, text regions) feature extraction via prompt diversification and hierarchical vision modeling (Xiao et al., 2023, Chen et al., 2024).

2. Unified Prompt-Based Multi-Task Learning

Florence-2 translates vision problems into a sequence-to-sequence format where every task is prompted via natural language and produces output as a sequence of tokens. The tokenizer is augmented with quantized spatial location tokens ( $l_k$ , $k \in 1 \ldots 1\,000$ ):

Captioning: “Describe the image.” $\rightarrow$ free-form text.
Object Detection: “Locate objects.” $\rightarrow$ textual class labels + bounding box location tokens.
Phrase Grounding, Referring Segmentation, OCR: all formulated as prompted translation tasks with outputs including coordinates, polygons, or structured text.

All vision tasks are unified under autoregressive decoding:

$\mathcal{L} = -\sum_{i=1}^{|y|} \log P_\theta(y_i \mid y_{<i}, x)$

where $x =$ [image embeddings; prompt tokens], $y$ is the output sequence.

Prompt versatility enables task switching with only a change of natural-language input, removing the need for dedicated task heads (Xiao et al., 2023).

3. Training Data and Pretraining Regime

Florence-2 is pre-trained on FLD-5B, a large-scale, high-quality dataset comprising $5.4 \times 10^9$ annotated visual samples built from 126 million images:

500M text annotations (brief/detailed captions)
1.3B region–text pairs (region-level phrases and captions)
3.6B text–phrase–region triplets (dense region and pixel-level relationships)

Annotation is achieved with an iterative pipeline involving specialist models (e.g., DINO for detection, SAM for masks, LLMs for captions) coupled with refinement, filtering, and merging. Spatial hierarchy is directly encoded through tokenized locations, allowing the model to handle tasks at all granularities (Xiao et al., 2023).

Training employs scalable infrastructure (Deepspeed, mixed precision) and large batch sizes (2k–3k), with curriculum moving from image-level to region- and pixel-level tasks, and leveraging cross-modal contrastive and masked image modeling objectives for both semantic and spatial alignment (Chen et al., 2024).

4. Generative Feature Fusion and Vision-Language Alignment

A notable innovation in Florence-2, especially as deployed in Florence-VL, is “depth-breadth fusion” (DBFusion):

Let $V \in \mathbb{R}^{N_v \times d}$ be the raw DaViT output and $V'_{t_i} \in \mathbb{R}^{N_v \times d}$ be the prompted feature for the $i$ -th task (e.g., caption, OCR, grounding). DBFusion fuses all such features via channel concatenation:

$F = \mathrm{Concat}_c([V, V'_{t_1}, \ldots, V'_{t_k}]) \in \mathbb{R}^{N_v \times (k+1)d}.$

A lightweight MLP then projects the fused $F$ into the token embedding space of the downstream LLM:

$E = \mathrm{LayerNorm}(F \cdot W_p + b_p) \in \mathbb{R}^{N_v \times d_{\text{LLM}}}.$

This joint embedding preserves multi-depth (from raw visual to high-level semantic) and multi-breadth (via prompt diversity) signal, enhancing alignment with language tokens in cross-modal downstream models. Ablation confirms that removal of either depth or breadth features degrades alignment and task performance (Chen et al., 2024).

5. Downstream Fine-Tuning and Adaptation

Florence-2 is expressly designed for both full-parameter and parameter-efficient adaptation:

Full-Parameter Fine-Tuning: The small and base variants (0.23B–0.77B params) are fine-tuned end-to-end even on a single high-end GPU (Khan et al., 2024).
LoRA and Low-Rank Adaptation: LoRA (Low-Rank Adaptation) is used to update only a small set ( $\sim$ 1–2M) of adapter parameters, with the base model weights frozen (Safwan et al., 6 Nov 2025, Ucar et al., 6 Mar 2025).
Quantization for Edge Deployment: INT4 quantization, combined with pruning, yields %%%%25 $\mathcal{L}_{\mathrm{task}} = -\sum_{t=1}^{|Y|} \log p(Y_t \mid X, Y_{<t}).$ 26%%%% faster inference on edge devices with negligible mAP/accuracy loss (Chavan et al., 10 Mar 2025).

Fine-tuning recipes involve AdamW/SGD, cosine learning rate schedules, data augmentations, and cross-entropy or multitask losses, depending on application. Domain-adaptive fine-tuning (e.g., for engineering drawing parsing, medical VQA, or real-time assistive navigation) yields significant boosts over both zero-shot baselines and closed-source competitors not amenable to efficient task-specific re-training (Khan et al., 2024, Khan et al., 20 Jun 2025, Safwan et al., 6 Nov 2025, Chavan et al., 10 Mar 2025).

6. Empirical Performance and Benchmark Results

Florence-2 delivers state-of-the-art or near state-of-the-art results across an extensive suite of vision and multimodal tasks, both in zero-shot and fine-tuned settings (Xiao et al., 2023, Chen et al., 2024):

COCO Captioning (CIDEr): 135.6 (zero-shot), 143.3 (multi-task fine-tune)
VQA (TextVQA): 81.7
Object Detection (COCO mAP): 43.4
Phrase Grounding (Flickr30k R@1): 84.4
RefCOCO/+/g ([email protected]): 56.3–68.0 (zero-shot), up to 93.4 (fine-tuned)
Segmentation (RefCOCO-RES mIoU): 80.5

Florence-2 consistently outperforms CLIP-style contrastive encoders on fine-grained, region/text-centric benchmarks due to its richer promptable generative representations.

In specialized domains:

Engineering Drawing Parsing: 61.5% F1, 23.3% hallucination (GD&T extraction, +52.4pp F1 over best closed-source baseline) (Khan et al., 2024).
Panoramic LiDAR: Richer captioning and object detection, improved robustness over CLIP (Cohen et al., 5 Feb 2025).
Medical VQA: Multi-task LoRA-fine-tuned models deliver +5.2% answer accuracy and large improvements in visual grounding metrics (Safwan et al., 6 Nov 2025).
Unstructured Object Detection: Fine-tuned Florence-2 (LoRA) achieves mAP $_{50}$ = 0.80 (YOLOv10: 0.74), remaining competitive with CNN detectors while retaining multimodal capabilities (Ucar et al., 6 Mar 2025).
Edge Applications: 4-bit quantized models maintain $<$ 1 pt drop in mAP, with 9 $\times$ speedup on sub-10W hardware (Chavan et al., 10 Mar 2025).

7. Limitations and Future Directions

While Florence-2 achieves significant advances, several challenges remain:

Structured Output Hallucination: Performance in parsing free-form, highly variable content (e.g., title blocks, tabular data) is limited by hallucination and variability in output structure (Khan et al., 20 Jun 2025).
Class Imbalance and Layout Generalization: Errors are concentrated in rare classes and irregular layouts, motivating research into schema-constrained decoding and layout-aware positional embeddings.
Scalability: While small and base models are amenable to resource-constrained fine-tuning, scaling up representations or adapting to rare domains may benefit from retrieval-augmented or semi-supervised strategies.
Hybrid Integration: For critical industrial deployments, hybrid frameworks combining Florence-2 outputs with lightweight rule-based or schema validation layers are advocated to ensure output reliability (Khan et al., 20 Jun 2025).

Ongoing work includes further refinement of the pretraining corpus, development of parameter-efficient finetuning methods, integration with external retrieval libraries, and schema-constrained generation to reduce hallucination and improve robustness in complex manufacturing and document-centric scenarios.

Florence-2 represents a comprehensive and flexible vision foundation model that advances unified prompt-based vision–language reasoning, establishes new state-of-the-art benchmarks in diverse domains, and provides a practical pathway for specialized task adaptation and edge deployment (Xiao et al., 2023, Chen et al., 2024, Khan et al., 20 Jun 2025, Khan et al., 2024, Safwan et al., 6 Nov 2025, Ucar et al., 6 Mar 2025, Cohen et al., 5 Feb 2025, Chavan et al., 10 Mar 2025).