Papers
Topics
Authors
Recent
Search
2000 character limit reached

DBFusion: Depth-Breadth Fusion in Multimodal Models

Updated 25 May 2026
  • DBFusion is a fusion strategy that integrates hierarchical (depth) and contextual (breadth) features to enhance model expressivity and precision.
  • It employs techniques such as channel integration, point-scattering, and cross-attention to effectively align diverse sensor inputs.
  • Empirical results in vision-language and BEV tasks confirm improved accuracy and robustness, validated through extensive ablation studies.

Depth-Breadth Fusion (DBFusion) refers to a family of architectural and algorithmic strategies for sensor or modality fusion where features are integrated across multiple representational "depths" (network layers or modal latent hierarchies) and "breadths" (contextual promptings, sensor views, or modality-specific semantics). DBFusion has recently emerged as a critical innovation in both large multimodal LLMs (Florence-VL) (Chen et al., 2024) and BEV sensor fusion for autonomous driving (BroadBEV) (Kim et al., 2023).

1. Theoretical Foundations: Depth and Breadth

DBFusion unifies two axes of feature diversity:

  • Depth captures the hierarchy of representations within a neural encoder. In Florence-VL, this involves extracting token embeddings from multiple layers of the Florence-2 DaViT backbone, where lower layers encode fine-grained, pixel-level patterns and higher layers encode semantic or relational abstractions. In BEV fusion (BroadBEV), depth refers to the reliable geometric “certainty” from LiDAR’s BEV maps versus the more ambiguous depth predictions from camera images.
  • Breadth introduces diversity by altering the context, prompt, or viewpoint. Florence-VL achieves breadth by processing an image under multiple prompts (e.g., detailed captioning, OCR, region-based grounding), extracting vision tokens specialized for each prompt. In BroadBEV, breadth corresponds to cross-modality: combining context-rich camera features and depth-focused LiDAR distributions.

This joint exploitation provides a richer, multi-granular embedding that improves downstream performance and cross-modal alignment.

2. Formal Framework and Fusion Mechanisms

Florence-VL’s DBFusion

Let xx denote an image input, d{1...D}d \in \{1...\mathcal{D}\} index DaViT backbone layers, p{1...P}p \in \{1...\mathcal{P}\} index prompts.

  • Per-layer, per-prompt feature extraction:

fd,p=Projd,p(V(d)(x;p))RNv×Df_{d,p} = \mathrm{Proj}_{d,p}\big(V^{(d)}(x;p)\big) \in \mathbb{R}^{N_v \times D'}

where V(d)(x;p)V^{(d)}(x;p) are the prompt-conditioned token embeddings, and Projd,p\mathrm{Proj}_{d,p} is a small MLP projection.

  • Depth-Breadth fusion:

F=[f1,1f1,2fD,P]RNv×(DDP)F = [f_{1,1} \,\|\, f_{1,2} \,\|\cdots\|\, f_{\mathcal{D},\mathcal{P}}] \in \mathbb{R}^{N_v \times (D' \cdot \mathcal{D} \cdot \mathcal{P})}

(Channel-concatenation).

  • Final projection to LLM space:

E=MLPfuse(F)RNv×dLE = \mathrm{MLP}_{\mathrm{fuse}}(F) \in \mathbb{R}^{N_v \times d_L}

The embedding EE is prepended with a positional token and fed into the LLM.

BroadBEV’s DBFusion (Editor’s term: BEV-DBFusion)

  • Input features:
    • Camera: C\mathbf{C} (context), d{1...D}d \in \{1...\mathcal{D}\}0 (depth distribution), derived from shared CNN backbone.
    • LiDAR: d{1...D}d \in \{1...\mathcal{D}\}1 and depth d{1...D}d \in \{1...\mathcal{D}\}2.
  • Point-scattering (depth synchronization):
    • BEV pooling maps camera features/probabilities to BEV grid.
    • A two-branch CNN fuses camera and LiDAR BEV depth:

    d{1...D}d \in \{1...\mathcal{D}\}3 - Camera BEV features are re-weighted by this synchronized depth.

  • ColFusion (collaborative self-attention):

    • Self-attention computed on each modality and cross-applied to the other (d{1...D}d \in \{1...\mathcal{D}\}4).
    • Final BEV feature:

    d{1...D}d \in \{1...\mathcal{D}\}5

3. DBFusion in Model Architectures

Florence-VL Pipeline Integration

DBFusion is positioned between the Florence-2 generative vision encoder and a LLM:

  1. Extract DaViT features at selected depths.

  2. For each prompt, propagate through encoder-decoder cross-attention to modulate vision tokens.

  3. Project each per-(depth, prompt) token stack to common embedding size.

  4. Concatenate along the channel dimension (channel integration).

  5. Fuse via MLP and align to LLM input dimension.

  6. Insert into LLM’s input stream as special vision tokens.

BroadBEV Sensor Fusion System

  1. Extract multi-view camera features and per-pixel depth distributions.

  2. Extract LiDAR BEV features and produce a BEV-aligned “depth heat map.”

  3. Perform BEV pooling to spatially align camera features and depths.

  4. Point-scattering synchronizes and fuses depth between modalities.

  5. ColFusion applies cross-modality attention between BEV features.

  6. A final FPN head decodes to semantic map or HD map output.

4. Training Paradigms and Losses

Florence-VL DBFusion

  • Stage A: Whole-model pretraining on 16.9M image-caption pairs with the entire stack (vision, fusion, LLM) updated. Optimized for caption likelihood under standard cross-entropy.

  • Stage B: Instruction tuning on ~10M vision-language instruction samples. Only projection heads, fusion MLP, and LLM are updated. Florence-2 encoder and lower projections are frozen.

  • Hyperparameters: batch sizes up to 4096, cosine-decay learning rates (d{1...D}d \in \{1...\mathcal{D}\}6 to d{1...D}d \in \{1...\mathcal{D}\}7), 3–5 epochs until validation perplexity convergence.

BroadBEV DBFusion

  • Supervision terms:

    • Depth losses: d{1...D}d \in \{1...\mathcal{D}\}8 on per-pixel and BEV depths via d{1...D}d \in \{1...\mathcal{D}\}9 or cross-entropy.
    • Semantic segmentation losses: p{1...P}p \in \{1...\mathcal{P}\}0 on per-cell class prediction.
    • Combined objective: p{1...P}p \in \{1...\mathcal{P}\}1
  • Training: AdamW optimizer, weight decay p{1...P}p \in \{1...\mathcal{P}\}2, 20 epochs, 3-headed multihead attention, extensive data augmentation.

5. Empirical Analysis, Ablations, and Visualization

Florence-VL Experimental Findings

Fusion Strategy Score on LLaVA-1.5 Efficiency
Token Integration 50.4 Low
Avg-pooling 50.3 Moderate
Channel Integration 50.8 High
  • Channel integration achieves best accuracy and efficiency.
  • Depth ablation: removing higher-level features degrades performance, e.g., on OCRBench, score drops from 41.4 to 31.2.
  • Breadth ablation: omitting prompt-specific features (detailed caption, OCR, or grounding) results in mean score drops of 0.1–1.0 points per task group.
  • Feature visualizations via PCA: orthogonality of prompt-specific features (scene, text, region) is demonstrated, with Florence-VL's features more tightly aligned to LLM space versus CLIP ViT-L/14.

BroadBEV Experimental Findings

  • State-of-the-art BEV mIoU on nuScenes: 70.1% (vs. 67.8% UniM²AE, 65.7% X-Align) for map segmentation.
  • Robustness: Rain 63.7% (vs. 57.8%), Night 50.8% (vs. 46.1%) mIoU.
  • HD-Map construction: BroadBEV achieves 64.0% mIoU, outperforming LiDAR2Map (58.1%).
  • Ablations: isolated point-scattering or ColFusion yield intermediate improvements (63.9%, 69.1%), but their combined use achieves the highest mIoU.
  • Qualitative analyses: point-scattering yields clearer distant feature recovery; ColFusion improves structural details via cross-modal attention.

6. Comparative Assessment and Significance

DBFusion improves cross-modal alignment and expressivity by leveraging both depth (multi-layer representations or reliable geometric structure) and breadth (prompt-induced, modality-diverse, or cross-attentive representations). In Florence-VL, DBFusion yields the lowest alignment loss among vision encoders and substantially outperforms prior MLLMs across vision-language understanding, OCR, chart reasoning, and knowledge-intensive benchmarks. In BEV perception, BroadBEV’s DBFusion enables denser, broader scene coverage and robustness under adverse conditions, outperforming both camera-only and LiDAR-only approaches.

Ablation studies confirm the necessity of integrating both axes; for both Florence-VL and BroadBEV, removal of either dimension leads to measurable drops in downstream accuracy and alignment scores.

7. Contextual Relevance and Future Outlook

DBFusion exemplifies a generalizable pattern for multimodal and cross-sensor fusion: multi-level, context-diverse feature aggregation followed by compact projection and integration. Its design underpins advances in vision-LLMs and BEV perception, providing a template for future systems seeking both semantic richness and geometric precision.

Key research groups, including the Florence-VL development team (Microsoft Research) and BroadBEV contributors, have open-sourced implementations and detailed recipes, facilitating further adoption and benchmarking. Likely future directions include attention-based fusion weighting, generalization to dynamic prompt sets, and extension to other domains where hierarchical and context-diverse fusion is critical (Chen et al., 2024, Kim et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Depth-Breadth Fusion (DBFusion).