MFB & MFH: Efficient Multimodal Fusion
- MFB is a multimodal fusion approach that factorizes full bilinear pooling into low-rank matrices, enabling efficient capture of second-order feature interactions.
- MFH extends MFB by cascading multiple fusion blocks to extract higher-order feature correlations, leading to measurable gains in VQA accuracy.
- Both methods balance computational efficiency and expressive power, underpinning state-of-the-art VQA architectures with robust normalization and dropout techniques.
Multimodal Factorized Bilinear (MFB) and its extension Multimodal Factorized High-order pooling (MFH) are families of efficient and expressive fusion operators designed for multimodal representation learning, particularly in the context of Visual Question Answering (VQA). They generalize bilinear pooling approaches by decomposing the parameter-intensive full bilinear tensor into low-rank factors, enabling tractable computation and higher-order feature interactions while retaining discriminative power. MFB and MFH have demonstrated significant empirical gains over previous multimodal fusion strategies on large-scale VQA tasks (Yu et al., 2017).
1. Mathematical Foundations
Given visual feature and textual feature , classical bilinear pooling computes, for output dimensions,
aggregated across an order-three tensor , which is prohibitive for large . MFB mitigates this by factorizing each as and with rank :
0
where 1 denotes elementwise multiplication and 2 is the all-ones vector in 3.
Stacking all factors results in 4, 5, and the expanded fusion vector
6
which is sum-pooled in non-overlapping windows of size 7 to yield 8:
9
Normalization—power normalization (0) and 1 normalization (2)—is typically applied.
MFH further increases representational capacity by cascading 3 MFB “blocks”; at block 4, the intermediate vector is recursively multiplied element-wise: 5 producing order-6 feature products. Outputs from each block are sum-pooled and normalized, then concatenated to form the final 7.
2. Relationship to MLB, MCB, and Expressivity
MFB and MFH are part of a continuum of multimodal fusion schemes:
- Bilinear Pooling: 8 parameters; not practical for high-dimensional features.
- MCB: Uses random projections (count sketch and circular convolution) to approximate outer products in high-dimensional space [performance: 59.8% (16K-D)].
- MLB: Sets 9, no pooling; 0—rank-1, first-order interactions, limited expressivity [performance: 59.7% (1K-D)].
- MFB: Equivalent to MFH with 1, 2, sum-pooling for second-order correlations, 3 parameters [60.9% (1K-D)].
- MFH: Cascading MFB (4), higher-order feature products, linear output growth in 5 [MFH6: 61.6%; MFH7: 61.5%].
Empirical results show progressive gains with higher-order pooling: MFH8 improves MFB by +0.7%, and saturates at MFH9.
3. Integration into VQA Architectures
MFB and MFH underpin several VQA architectures. The baseline no-attention pipeline extracts a 2048-D image vector 0 from ResNet-152 and a 1024-D question vector 1 from an LSTM; fusion is performed via MFB or MFH, followed by a softmax classifier over 2 answer classes (Yu et al., 2017).
A co-attention variant employs grid-level image features (3, 4) and word-level question features (5, 6). Question self-attention yields 7, image attention is guided by 8 to obtain 9, and fusion proceeds via MFH. Attention maps and fusion blocks may be parallelized (“glimpses”).
4. Training Methods and Regularization
Training employs KL-divergence loss over predicted vs ground-truth answer distributions:
0
with Adam optimizer (1, 2, initial lr=3), batch size 200 (no attention) or 64 (attention), dropout after LSTM (4) and after each MFB/expand (5), and feature normalization. Typical hyperparameters: 6, 7 for each MFB block, 8 for MFH (output dimension 2000), 9 answer classes.
Word embeddings are learned and optionally concatenated with 300-D pretrained GloVe vectors.
5. Empirical Results and Ablation Studies
Empirical evaluation on VQA-1.0 open-ended test-dev with single models:
- MFB (0): 60.9% accuracy.
- MFH1 (2): 61.6% (+0.7 over MFB).
- MFH3: 61.5% (performance saturates).
- Attention variants: MFB+Att 64.6%, MFH+Att 65.3%.
- Co-attention: MFH+CoAtt 65.8%; +GloVe 66.8%; +VG data 67.7%.
- Ensemble (7 models) MFH+CoAtt+GloVe: 69.2% (test-std), outperforming ensemble MLB (66.9%), MCB (66.5%).
- VQA-2.0 test-dev: single MFH+CoAtt+GloVe: 65.80%; ensemble(9): 68.02% (2nd in VQA2017).
Ablation studies highlight the importance of normalization (without it, MFB drops 43 points), the superiority of KL-loss over answer sampling (+0.2–0.3%, faster convergence), and the necessity of higher-order pooling (improvement of 50.7% from 6 to 7).
6. Implementation and Computational Characteristics
MFH is efficiently realizable within modern deep learning frameworks. Pseudocode from (Yu et al., 2017) prescribes, for each MFH block, two linear projections for 8 and 9, elementwise multiplication, dropout, recursive elementwise multiplication for higher-order interactions, sum-pooling, and normalization. The output vector grows linearly with the number of blocks 0; parameter overhead is 1 per block.
In practice, pooling, dropout, and normalization are implemented as grouped operations and pointwise transformations, ensuring compatibility with PyTorch or TensorFlow. This architecture enables scalable, expressive, and tractable multimodal fusion applicable to large-scale VQA and related multimodal tasks.
7. Significance and Research Impact
MFB and MFH provide a principled and computationally feasible approach to multimodal fusion, capturing rich cross-modal interactions without incurring the combinatorial explosion of full bilinear pooling. The effectiveness of higher-order pooling is empirically established, with ablations confirming that elementwise product cascades enable richer combinatorial patterns between vision and language representations. These methods have advanced the state of the art in VQA (runner-up, VQA Challenge 2017) and set a reference point for subsequent multimodal fusion research (Yu et al., 2017).