Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 57 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 106 tok/s Pro
Kimi K2 185 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

LoRA-tuned Florence-2 Model

Updated 11 November 2025
  • The paper introduces a dual-encoder transformer architecture integrated with LoRA to achieve parameter-efficient fine-tuning and near state-of-the-art detection accuracy.
  • It demonstrates that optimizing only 0.71% of the total parameters reduces memory and compute requirements while retaining robust multi-modal capabilities.
  • Empirical evaluations on RGB datasets show that the LoRA-tuned Florence-2 model rivals YOLO variants, offering stable training with applications like VQA and segmentation.

The LoRA-tuned Florence-2 model is a vision-language transformer system adapted for object detection in un-constructed, cluttered environments through the integration of Low-Rank Adaptation (LoRA) techniques. Florence-2 leverages a dual-encoder transformer design, combining large-scale visual and textual information processing. The application of LoRA enables parameter-efficient fine-tuning, facilitating tractable training on resource-constrained hardware while achieving near state-of-the-art detection accuracy.

1. Florence-2 Base Architecture and LoRA Integration

Florence-2-base is composed of approximately 270 million parameters and features a dual-encoder, transformer-based architecture tailored for multi-modal vision-language tasks. The visual encoder employs the DaViT (Dual-attention Vision Transformer), producing hierarchical patch embeddings across four stages. Textual information is processed through a BERT-style encoder, accommodating both text tokens and specialized tokens for object detection and bounding box coordinates (e.g., <OD>, x0, y0, x1, y1).

The backbone consists of stacked transformer encoder-decoder layers that fuse visual and textual streams to produce final bounding box and class predictions. LoRA adaptation modules are injected into each block's core linear projection layers: Queries (Wq), Keys (Wk), Values (Wv), and the initial Feed-Forward layer (W1).

During LoRA fine-tuning, all original Florence-2 weights remain frozen. Only the parameters of the LoRA adapters (matrices B and A) within the specified layers are optimized. This enables highly parameter-efficient adaptation and reduces memory and compute requirements during training.

2. LoRA Mathematical Formulation

In conventional transformer fine-tuning, the full weight matrix WRd×dW \in \mathbb{R}^{d \times d} for each linear transformation is updated. LoRA replaces full-rank optimization with a low-rank correction term:

ΔW=BA,BRd×r,ARr×d,rd\Delta W = B A, \quad B \in \mathbb{R}^{d \times r},\, A \in \mathbb{R}^{r \times d},\, r \ll d

The updated weight at training time is:

W=W+αΔW=W+αBAW' = W + \alpha \, \Delta W = W + \alpha \, BA

where rr is the LoRA rank hyperparameter and α\alpha is a scaling factor. Typical forward computation for a linear layer becomes:

y=Wx+αB(Ax)y = W x + \alpha\, B (A x)

Only BB and AA are trained; WW remains frozen. LoRA-specific regularization involves standard weight decay on αBA\alpha\, BA when AdamW is used, with no special penalties applied to LoRA parameters.

Experimental sweeps evaluated r{4,8,10,16,32}r \in \{4,8,10,16,32\} and α{8,16,20,32}\alpha \in \{8,16,20,32\}.

3. Fine-tuning Procedures and Hyperparameters

The empirical search for optimal fine-tuning parameters yielded the following best configuration:

  • Optimizer: AdamW (no weight decay)
  • Learning Rate: 5×1065 \times 10^{-6} (constant, with linear warmup and decay)
  • Batch Size: 6
  • Epochs: 10 (overfitting at 12\geq 12 epochs)
  • LoRA: rank r=8r=8, α=16\alpha=16, dropout 0.05
  • Trainable Parameters: \sim1.93M (0.71% of 272.7M total)
  • Hardware: NVIDIA L4 GPU (4 hours total fine-tuning time); additional experiments on T4 and A100

Alternative configurations using SGD and weight decay were tested. Results consistently favored the AdamW optimizer without weight decay for convergence speed and generalization.

4. Datasets and Data Augmentation Strategies

Merged RGB datasets from two domains were utilized: PST900-RGB (subterranean, low-light, 894 images) and DARPA SubT facility (894 images), totaling 1,788 images. The object detection task involved five classes: Backpack, Cellphone, Drill, Fire Extinguisher, and Survivor.

Data splits:

  • Train: 70% (1,251 images)
  • Validation: 20% (357 images)
  • Test: 10% (178 images)

Augmentation strategies, implemented via RoboFlow, expanded datasets to 2,818 (primary) and 3,981 (secondary) images while preserving original split ratios. Image preprocessing steps included resizing to 640×640640 \times 640, horizontal flipping, ±\pm90° rotation, ±\pm10° shearing, ±\pm25% saturation, and ±\pm10% exposure adjustments.

5. Evaluation, Ablation Studies, and Comparative Performance

5.1 Detection Accuracy

The best LoRA-tuned Florence-2 configuration (L4 GPU, r=8, α=16\alpha=16, lr=5×1065\times10^{-6}, AdamW, batch size 6, 10 epochs) achieved:

  • mAP50=0.80\mathrm{mAP}_{50} = 0.80
  • mAP75=0.57\mathrm{mAP}_{75} = 0.57
  • mAP50:95=0.56\mathrm{mAP}_{50:95} = 0.56
  • Training/Validation loss: 1.16 / 1.12

5.2 Hyperparameter Ablation

Optimizer Learning Rate mAP50_{50} mAP75_{75} mAP50:95_{50:95} Train Loss Val Loss
AdamW (wd=0) 5×1065\times10^{-6} 0.80 0.57 0.56 1.16 1.12
SGD 5×1065\times10^{-6} 0.60 0.44 0.42 1.41 1.29
AdamW (wd=0.01) 1×1061\times10^{-6} 0.56 0.39 0.36 1.44 1.30
SGD (momentum=0.9) 3×1063\times10^{-6} 0.74 0.52 0.50 1.41 1.32
AdamW (wd=0) 3×1063\times10^{-6} 0.79 0.57 0.54 1.17 1.08

5.3 Comparison with YOLO Family

Model mAP50_{50} mAP50:95_{50:95}
YOLOv8 0.84 0.56
YOLOv9 0.84 0.58
YOLOv10 0.74 0.48

Florence-2 + LoRA ($0.80/0.56$) outperforms YOLOv10 and closely matches YOLOv8/v9, while maintaining additional capabilities such as captioning, Visual Question Answering (VQA), and segmentation.

6. Parameter and Compute Efficiency

  • Parameter Efficiency: Only \sim0.7% of Florence-2’s parameters (\sim1.93M/272.7M) were updated during LoRA fine-tuning.
  • Memory and Training Speed: The approach yields a smaller optimizer state (16-bit precision for LoRA matrices only), allowing training to fit on modest GPUs (L4, T4) with total fine-tuning time around 4 hours.
  • Compute Cost: Full-model fine-tuning would typically require days; LoRA adaptation reduces this to hours.
  • Robustness: Training curves exhibit stability, with early stopping (10–12 epochs) effective at preventing overfitting.

A plausible implication is that LoRA adaptation permits rapid, resource-efficient specialization of multi-modal transformers for complex, domain-specific tasks, without sacrificing their inherent multi-tasking capabilities.

7. Broader Context and Prospects

The integration of LoRA with Florence-2 demonstrates that transformer-based vision-LLMs can be efficiently adapted for challenging object detection scenarios in GPS-denied, disordered real-world environments. Performance near the level of fully fine-tuned, convolutional baselines is maintained while enabling substantial reductions in both trainable parameter count and compute expenditure. Additionally, the ability to retain vision-language reasoning capabilities (e.g., captioning, VQA, segmentation) positions LoRA-tuned Florence-2 models as flexible, scalable solutions for advanced applications in environments where traditional methods may falter. This suggests future work may extend LoRA-based adaptation to larger backbones, additional modalities, or more granular downstream tasks (Ucar et al., 6 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LoRA-tuned Florence-2 Model.