LoRA-tuned Florence-2 Model

Updated 11 November 2025

The paper introduces a dual-encoder transformer architecture integrated with LoRA to achieve parameter-efficient fine-tuning and near state-of-the-art detection accuracy.
It demonstrates that optimizing only 0.71% of the total parameters reduces memory and compute requirements while retaining robust multi-modal capabilities.
Empirical evaluations on RGB datasets show that the LoRA-tuned Florence-2 model rivals YOLO variants, offering stable training with applications like VQA and segmentation.

The LoRA-tuned Florence-2 model is a vision-language transformer system adapted for object detection in un-constructed, cluttered environments through the integration of Low-Rank Adaptation (LoRA) techniques. Florence-2 leverages a dual-encoder transformer design, combining large-scale visual and textual information processing. The application of LoRA enables parameter-efficient fine-tuning, facilitating tractable training on resource-constrained hardware while achieving near state-of-the-art detection accuracy.

1. Florence-2 Base Architecture and LoRA Integration

Florence-2-base is composed of approximately 270 million parameters and features a dual-encoder, transformer-based architecture tailored for multi-modal vision-language tasks. The visual encoder employs the @@@@6@@@@ (Dual-attention Vision Transformer), producing hierarchical patch embeddings across four stages. Textual information is processed through a BERT-style encoder, accommodating both text tokens and specialized tokens for object detection and bounding box coordinates (e.g., <OD>, x0, y0, x1, y1).

The backbone consists of stacked transformer encoder-decoder layers that fuse visual and textual streams to produce final bounding box and class predictions. LoRA adaptation modules are injected into each block's core linear projection layers: Queries (Wq), Keys (Wk), Values (Wv), and the initial Feed-Forward layer (W1).

During LoRA fine-tuning, all original Florence-2 weights remain frozen. Only the parameters of the LoRA adapters (matrices B and A) within the specified layers are optimized. This enables highly parameter-efficient adaptation and reduces memory and compute requirements during training.

2. LoRA Mathematical Formulation

In conventional transformer fine-tuning, the full weight matrix $W \in \mathbb{R}^{d \times d}$ for each linear transformation is updated. LoRA replaces full-rank optimization with a low-rank correction term:

$\Delta W = B A, \quad B \in \mathbb{R}^{d \times r},\, A \in \mathbb{R}^{r \times d},\, r \ll d$

The updated weight at training time is:

$W' = W + \alpha \, \Delta W = W + \alpha \, BA$

where $r$ is the LoRA rank hyperparameter and $\alpha$ is a scaling factor. Typical forward computation for a linear layer becomes:

$y = W x + \alpha\, B (A x)$

Only $B$ and $A$ are trained; $W$ remains frozen. LoRA-specific regularization involves standard weight decay on $\alpha\, BA$ when AdamW is used, with no special penalties applied to LoRA parameters.

Experimental sweeps evaluated %%%%10%%%% and $\alpha \in \{8,16,20,32\}$ .

3. Fine-tuning Procedures and Hyperparameters

The empirical search for optimal fine-tuning parameters yielded the following best configuration:

Optimizer: AdamW (no weight decay)
Learning Rate: $5 \times 10^{-6}$ (constant, with linear warmup and decay)
Batch Size: 6
Epochs: 10 (overfitting at $\geq 12$ epochs)
LoRA: rank $r=8$ , $\alpha=16$ , dropout 0.05
Trainable Parameters: $\sim$ 1.93M (0.71% of 272.7M total)
Hardware: NVIDIA L4 GPU (4 hours total fine-tuning time); additional experiments on T4 and A100

Alternative configurations using SGD and weight decay were tested. Results consistently favored the AdamW optimizer without weight decay for convergence speed and generalization.

4. Datasets and Data Augmentation Strategies

Merged RGB datasets from two domains were utilized: PST900-RGB (subterranean, low-light, 894 images) and DARPA SubT facility (894 images), totaling 1,788 images. The object detection task involved five classes: Backpack, Cellphone, Drill, Fire Extinguisher, and Survivor.

Data splits:

Train: 70% (1,251 images)
Validation: 20% (357 images)
Test: 10% (178 images)

Augmentation strategies, implemented via RoboFlow, expanded datasets to 2,818 (primary) and 3,981 (secondary) images while preserving original split ratios. Image preprocessing steps included resizing to $640 \times 640$ , horizontal flipping, $\pm$ 90° rotation, $\pm$ 10° shearing, $\pm$ 25% saturation, and $\pm$ 10% exposure adjustments.

5. Evaluation, Ablation Studies, and Comparative Performance

5.1 Detection Accuracy

The best LoRA-tuned Florence-2 configuration (L4 GPU, r=8, $\alpha=16$ , lr= $5\times10^{-6}$ , AdamW, batch size 6, 10 epochs) achieved:

$\mathrm{mAP}_{50} = 0.80$
$\mathrm{mAP}_{75} = 0.57$
$\mathrm{mAP}_{50:95} = 0.56$
Training/Validation loss: 1.16 / 1.12

5.2 Hyperparameter Ablation

Optimizer	Learning Rate	mAP $_{50}$	mAP $_{75}$	mAP $_{50:95}$	Train Loss	Val Loss
AdamW (wd=0)	$5\times10^{-6}$	0.80	0.57	0.56	1.16	1.12
SGD	$5\times10^{-6}$	0.60	0.44	0.42	1.41	1.29
AdamW (wd=0.01)	$1\times10^{-6}$	0.56	0.39	0.36	1.44	1.30
SGD (momentum=0.9)	$3\times10^{-6}$	0.74	0.52	0.50	1.41	1.32
AdamW (wd=0)	$3\times10^{-6}$	0.79	0.57	0.54	1.17	1.08

5.3 Comparison with YOLO Family

Model	mAP $_{50}$	mAP $_{50:95}$
YOLOv8	0.84	0.56
YOLOv9	0.84	0.58
YOLOv10	0.74	0.48

Florence-2 + LoRA ($0.80/0.56$) outperforms YOLOv10 and closely matches YOLOv8/v9, while maintaining additional capabilities such as captioning, Visual Question Answering (VQA), and segmentation.

6. Parameter and Compute Efficiency

Parameter Efficiency: Only $\sim$ 0.7% of Florence-2’s parameters ( $\sim$ 1.93M/272.7M) were updated during LoRA fine-tuning.
Memory and Training Speed: The approach yields a smaller optimizer state (16-bit precision for LoRA matrices only), allowing training to fit on modest GPUs (L4, T4) with total fine-tuning time around 4 hours.
Compute Cost: Full-model fine-tuning would typically require days; LoRA adaptation reduces this to hours.
Robustness: Training curves exhibit stability, with early stopping (10–12 epochs) effective at preventing overfitting.

A plausible implication is that LoRA adaptation permits rapid, resource-efficient specialization of multi-modal transformers for complex, domain-specific tasks, without sacrificing their inherent multi-tasking capabilities.

7. Broader Context and Prospects

The integration of LoRA with Florence-2 demonstrates that transformer-based vision-LLMs can be efficiently adapted for challenging object detection scenarios in GPS-denied, disordered real-world environments. Performance near the level of fully fine-tuned, convolutional baselines is maintained while enabling substantial reductions in both trainable parameter count and compute expenditure. Additionally, the ability to retain vision-language reasoning capabilities (e.g., captioning, VQA, segmentation) positions LoRA-tuned Florence-2 models as flexible, scalable solutions for advanced applications in environments where traditional methods may falter. This suggests future work may extend LoRA-based adaptation to larger backbones, additional modalities, or more granular downstream tasks (Ucar et al., 6 Mar 2025).

PDF Markdown Chat (Pro)

References (1)

Fine-Tuning Florence2 for Enhanced Object Detection in Un-constructed Environments: Vision-Language Model Approach (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to LoRA-tuned Florence-2 Model.

LoRA-tuned Florence-2 Model

1. Florence-2 Base Architecture and LoRA Integration

2. LoRA Mathematical Formulation

3. Fine-tuning Procedures and Hyperparameters

4. Datasets and Data Augmentation Strategies

5. Evaluation, Ablation Studies, and Comparative Performance

5.1 Detection Accuracy

5.2 Hyperparameter Ablation

5.3 Comparison with YOLO Family

6. Parameter and Compute Efficiency

7. Broader Context and Prospects

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

LoRA-tuned Florence-2 Model

1. Florence-2 Base Architecture and LoRA Integration

2. LoRA Mathematical Formulation

3. Fine-tuning Procedures and Hyperparameters

4. Datasets and Data Augmentation Strategies

5. Evaluation, Ablation Studies, and Comparative Performance

5.1 Detection Accuracy

5.2 Hyperparameter Ablation

5.3 Comparison with YOLO Family

6. Parameter and Compute Efficiency

7. Broader Context and Prospects

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research