Papers
Topics
Authors
Recent
Search
2000 character limit reached

MFB & MFH: Efficient Multimodal Fusion

Updated 23 April 2026
  • MFB is a multimodal fusion approach that factorizes full bilinear pooling into low-rank matrices, enabling efficient capture of second-order feature interactions.
  • MFH extends MFB by cascading multiple fusion blocks to extract higher-order feature correlations, leading to measurable gains in VQA accuracy.
  • Both methods balance computational efficiency and expressive power, underpinning state-of-the-art VQA architectures with robust normalization and dropout techniques.

Multimodal Factorized Bilinear (MFB) and its extension Multimodal Factorized High-order pooling (MFH) are families of efficient and expressive fusion operators designed for multimodal representation learning, particularly in the context of Visual Question Answering (VQA). They generalize bilinear pooling approaches by decomposing the parameter-intensive full bilinear tensor into low-rank factors, enabling tractable computation and higher-order feature interactions while retaining discriminative power. MFB and MFH have demonstrated significant empirical gains over previous multimodal fusion strategies on large-scale VQA tasks (Yu et al., 2017).

1. Mathematical Foundations

Given visual feature xRmx\in\mathbb{R}^m and textual feature yRny\in\mathbb{R}^n, classical bilinear pooling computes, for oo output dimensions,

zi=xWiy,WiRm×n,i=1,,o,z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,

aggregated across an order-three tensor WRm×n×oW\in\mathbb{R}^{m\times n\times o}, which is prohibitive for large m,n,om,n,o. MFB mitigates this by factorizing each WiW_i as UiRm×kU_i\in\mathbb{R}^{m\times k} and ViRn×kV_i\in\mathbb{R}^{n\times k} with rank kk:

yRny\in\mathbb{R}^n0

where yRny\in\mathbb{R}^n1 denotes elementwise multiplication and yRny\in\mathbb{R}^n2 is the all-ones vector in yRny\in\mathbb{R}^n3.

Stacking all factors results in yRny\in\mathbb{R}^n4, yRny\in\mathbb{R}^n5, and the expanded fusion vector

yRny\in\mathbb{R}^n6

which is sum-pooled in non-overlapping windows of size yRny\in\mathbb{R}^n7 to yield yRny\in\mathbb{R}^n8:

yRny\in\mathbb{R}^n9

Normalization—power normalization (oo0) and oo1 normalization (oo2)—is typically applied.

MFH further increases representational capacity by cascading oo3 MFB “blocks”; at block oo4, the intermediate vector is recursively multiplied element-wise: oo5 producing order-oo6 feature products. Outputs from each block are sum-pooled and normalized, then concatenated to form the final oo7.

2. Relationship to MLB, MCB, and Expressivity

MFB and MFH are part of a continuum of multimodal fusion schemes:

  • Bilinear Pooling: oo8 parameters; not practical for high-dimensional features.
  • MCB: Uses random projections (count sketch and circular convolution) to approximate outer products in high-dimensional space [performance: 59.8% (16K-D)].
  • MLB: Sets oo9, no pooling; zi=xWiy,WiRm×n,i=1,,o,z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,0—rank-1, first-order interactions, limited expressivity [performance: 59.7% (1K-D)].
  • MFB: Equivalent to MFH with zi=xWiy,WiRm×n,i=1,,o,z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,1, zi=xWiy,WiRm×n,i=1,,o,z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,2, sum-pooling for second-order correlations, zi=xWiy,WiRm×n,i=1,,o,z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,3 parameters [60.9% (1K-D)].
  • MFH: Cascading MFB (zi=xWiy,WiRm×n,i=1,,o,z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,4), higher-order feature products, linear output growth in zi=xWiy,WiRm×n,i=1,,o,z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,5 [MFHzi=xWiy,WiRm×n,i=1,,o,z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,6: 61.6%; MFHzi=xWiy,WiRm×n,i=1,,o,z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,7: 61.5%].

Empirical results show progressive gains with higher-order pooling: MFHzi=xWiy,WiRm×n,i=1,,o,z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,8 improves MFB by +0.7%, and saturates at MFHzi=xWiy,WiRm×n,i=1,,o,z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,9.

3. Integration into VQA Architectures

MFB and MFH underpin several VQA architectures. The baseline no-attention pipeline extracts a 2048-D image vector WRm×n×oW\in\mathbb{R}^{m\times n\times o}0 from ResNet-152 and a 1024-D question vector WRm×n×oW\in\mathbb{R}^{m\times n\times o}1 from an LSTM; fusion is performed via MFB or MFH, followed by a softmax classifier over WRm×n×oW\in\mathbb{R}^{m\times n\times o}2 answer classes (Yu et al., 2017).

A co-attention variant employs grid-level image features (WRm×n×oW\in\mathbb{R}^{m\times n\times o}3, WRm×n×oW\in\mathbb{R}^{m\times n\times o}4) and word-level question features (WRm×n×oW\in\mathbb{R}^{m\times n\times o}5, WRm×n×oW\in\mathbb{R}^{m\times n\times o}6). Question self-attention yields WRm×n×oW\in\mathbb{R}^{m\times n\times o}7, image attention is guided by WRm×n×oW\in\mathbb{R}^{m\times n\times o}8 to obtain WRm×n×oW\in\mathbb{R}^{m\times n\times o}9, and fusion proceeds via MFH. Attention maps and fusion blocks may be parallelized (“glimpses”).

4. Training Methods and Regularization

Training employs KL-divergence loss over predicted vs ground-truth answer distributions:

m,n,om,n,o0

with Adam optimizer (m,n,om,n,o1, m,n,om,n,o2, initial lr=m,n,om,n,o3), batch size 200 (no attention) or 64 (attention), dropout after LSTM (m,n,om,n,o4) and after each MFB/expand (m,n,om,n,o5), and feature normalization. Typical hyperparameters: m,n,om,n,o6, m,n,om,n,o7 for each MFB block, m,n,om,n,o8 for MFH (output dimension 2000), m,n,om,n,o9 answer classes.

Word embeddings are learned and optionally concatenated with 300-D pretrained GloVe vectors.

5. Empirical Results and Ablation Studies

Empirical evaluation on VQA-1.0 open-ended test-dev with single models:

  • MFB (WiW_i0): 60.9% accuracy.
  • MFHWiW_i1 (WiW_i2): 61.6% (+0.7 over MFB).
  • MFHWiW_i3: 61.5% (performance saturates).
  • Attention variants: MFB+Att 64.6%, MFH+Att 65.3%.
  • Co-attention: MFH+CoAtt 65.8%; +GloVe 66.8%; +VG data 67.7%.
  • Ensemble (7 models) MFH+CoAtt+GloVe: 69.2% (test-std), outperforming ensemble MLB (66.9%), MCB (66.5%).
  • VQA-2.0 test-dev: single MFH+CoAtt+GloVe: 65.80%; ensemble(9): 68.02% (2nd in VQA2017).

Ablation studies highlight the importance of normalization (without it, MFB drops WiW_i43 points), the superiority of KL-loss over answer sampling (+0.2–0.3%, faster convergence), and the necessity of higher-order pooling (improvement of WiW_i50.7% from WiW_i6 to WiW_i7).

6. Implementation and Computational Characteristics

MFH is efficiently realizable within modern deep learning frameworks. Pseudocode from (Yu et al., 2017) prescribes, for each MFH block, two linear projections for WiW_i8 and WiW_i9, elementwise multiplication, dropout, recursive elementwise multiplication for higher-order interactions, sum-pooling, and normalization. The output vector grows linearly with the number of blocks UiRm×kU_i\in\mathbb{R}^{m\times k}0; parameter overhead is UiRm×kU_i\in\mathbb{R}^{m\times k}1 per block.

In practice, pooling, dropout, and normalization are implemented as grouped operations and pointwise transformations, ensuring compatibility with PyTorch or TensorFlow. This architecture enables scalable, expressive, and tractable multimodal fusion applicable to large-scale VQA and related multimodal tasks.

7. Significance and Research Impact

MFB and MFH provide a principled and computationally feasible approach to multimodal fusion, capturing rich cross-modal interactions without incurring the combinatorial explosion of full bilinear pooling. The effectiveness of higher-order pooling is empirically established, with ablations confirming that elementwise product cascades enable richer combinatorial patterns between vision and language representations. These methods have advanced the state of the art in VQA (runner-up, VQA Challenge 2017) and set a reference point for subsequent multimodal fusion research (Yu et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multimodal Factorized Bilinear (MFB).