MFB & MFH: Efficient Multimodal Fusion

Updated 23 April 2026

MFB is a multimodal fusion approach that factorizes full bilinear pooling into low-rank matrices, enabling efficient capture of second-order feature interactions.
MFH extends MFB by cascading multiple fusion blocks to extract higher-order feature correlations, leading to measurable gains in VQA accuracy.
Both methods balance computational efficiency and expressive power, underpinning state-of-the-art VQA architectures with robust normalization and dropout techniques.

Multimodal Factorized Bilinear (MFB) and its extension Multimodal Factorized High-order pooling (MFH) are families of efficient and expressive fusion operators designed for multimodal representation learning, particularly in the context of Visual Question Answering (VQA). They generalize bilinear pooling approaches by decomposing the parameter-intensive full bilinear tensor into low-rank factors, enabling tractable computation and higher-order feature interactions while retaining discriminative power. MFB and MFH have demonstrated significant empirical gains over previous multimodal fusion strategies on large-scale VQA tasks (Yu et al., 2017).

1. Mathematical Foundations

Given visual feature $x\in\mathbb{R}^m$ and textual feature $y\in\mathbb{R}^n$ , classical bilinear pooling computes, for $o$ output dimensions,

$z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,$

aggregated across an order-three tensor $W\in\mathbb{R}^{m\times n\times o}$ , which is prohibitive for large $m,n,o$ . MFB mitigates this by factorizing each $W_i$ as $U_i\in\mathbb{R}^{m\times k}$ and $V_i\in\mathbb{R}^{n\times k}$ with rank $k$ :

$y\in\mathbb{R}^n$ 0

where $y\in\mathbb{R}^n$ 1 denotes elementwise multiplication and $y\in\mathbb{R}^n$ 2 is the all-ones vector in $y\in\mathbb{R}^n$ 3.

Stacking all factors results in $y\in\mathbb{R}^n$ 4, $y\in\mathbb{R}^n$ 5, and the expanded fusion vector

$y\in\mathbb{R}^n$ 6

which is sum-pooled in non-overlapping windows of size $y\in\mathbb{R}^n$ 7 to yield $y\in\mathbb{R}^n$ 8:

$y\in\mathbb{R}^n$ 9

Normalization—power normalization ( $o$ 0) and $o$ 1 normalization ( $o$ 2)—is typically applied.

MFH further increases representational capacity by cascading $o$ 3 MFB “blocks”; at block $o$ 4, the intermediate vector is recursively multiplied element-wise: $o$ 5 producing order- $o$ 6 feature products. Outputs from each block are sum-pooled and normalized, then concatenated to form the final $o$ 7.

2. Relationship to MLB, MCB, and Expressivity

MFB and MFH are part of a continuum of multimodal fusion schemes:

Bilinear Pooling: $o$ 8 parameters; not practical for high-dimensional features.
MCB: Uses random projections (count sketch and circular convolution) to approximate outer products in high-dimensional space [performance: 59.8% (16K-D)].
MLB: Sets $o$ 9, no pooling; $z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,$ 0—rank-1, first-order interactions, limited expressivity [performance: 59.7% (1K-D)].
MFB: Equivalent to MFH with $z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,$ 1, $z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,$ 2, sum-pooling for second-order correlations, $z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,$ 3 parameters [60.9% (1K-D)].
MFH: Cascading MFB ( $z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,$ 4), higher-order feature products, linear output growth in $z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,$ 5 [MFH $z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,$ 6: 61.6%; MFH $z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,$ 7: 61.5%].

Empirical results show progressive gains with higher-order pooling: MFH $z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,$ 8 improves MFB by +0.7%, and saturates at MFH $z_i = x^\top W_i y,\quad W_i\in\mathbb{R}^{m\times n},\quad i=1,\dots,o,$ 9.

3. Integration into VQA Architectures

MFB and MFH underpin several VQA architectures. The baseline no-attention pipeline extracts a 2048-D image vector $W\in\mathbb{R}^{m\times n\times o}$ 0 from ResNet-152 and a 1024-D question vector $W\in\mathbb{R}^{m\times n\times o}$ 1 from an LSTM; fusion is performed via MFB or MFH, followed by a softmax classifier over $W\in\mathbb{R}^{m\times n\times o}$ 2 answer classes (Yu et al., 2017).

A co-attention variant employs grid-level image features ( $W\in\mathbb{R}^{m\times n\times o}$ 3, $W\in\mathbb{R}^{m\times n\times o}$ 4) and word-level question features ( $W\in\mathbb{R}^{m\times n\times o}$ 5, $W\in\mathbb{R}^{m\times n\times o}$ 6). Question self-attention yields $W\in\mathbb{R}^{m\times n\times o}$ 7, image attention is guided by $W\in\mathbb{R}^{m\times n\times o}$ 8 to obtain $W\in\mathbb{R}^{m\times n\times o}$ 9, and fusion proceeds via MFH. Attention maps and fusion blocks may be parallelized (“glimpses”).

4. Training Methods and Regularization

Training employs KL-divergence loss over predicted vs ground-truth answer distributions:

$m,n,o$ 0

with Adam optimizer ( $m,n,o$ 1, $m,n,o$ 2, initial lr= $m,n,o$ 3), batch size 200 (no attention) or 64 (attention), dropout after LSTM ( $m,n,o$ 4) and after each MFB/expand ( $m,n,o$ 5), and feature normalization. Typical hyperparameters: $m,n,o$ 6, $m,n,o$ 7 for each MFB block, $m,n,o$ 8 for MFH (output dimension 2000), $m,n,o$ 9 answer classes.

Word embeddings are learned and optionally concatenated with 300-D pretrained GloVe vectors.

5. Empirical Results and Ablation Studies

Empirical evaluation on VQA-1.0 open-ended test-dev with single models:

MFB ( $W_i$ 0): 60.9% accuracy.
MFH $W_i$ 1 ( $W_i$ 2): 61.6% (+0.7 over MFB).
MFH $W_i$ 3: 61.5% (performance saturates).
Attention variants: MFB+Att 64.6%, MFH+Att 65.3%.
Co-attention: MFH+CoAtt 65.8%; +GloVe 66.8%; +VG data 67.7%.
Ensemble (7 models) MFH+CoAtt+GloVe: 69.2% (test-std), outperforming ensemble MLB (66.9%), MCB (66.5%).
VQA-2.0 test-dev: single MFH+CoAtt+GloVe: 65.80%; ensemble(9): 68.02% (2nd in VQA2017).

Ablation studies highlight the importance of normalization (without it, MFB drops $W_i$ 43 points), the superiority of KL-loss over answer sampling (+0.2–0.3%, faster convergence), and the necessity of higher-order pooling (improvement of $W_i$ 50.7% from $W_i$ 6 to $W_i$ 7).

6. Implementation and Computational Characteristics

MFH is efficiently realizable within modern deep learning frameworks. Pseudocode from (Yu et al., 2017) prescribes, for each MFH block, two linear projections for $W_i$ 8 and $W_i$ 9, elementwise multiplication, dropout, recursive elementwise multiplication for higher-order interactions, sum-pooling, and normalization. The output vector grows linearly with the number of blocks $U_i\in\mathbb{R}^{m\times k}$ 0; parameter overhead is $U_i\in\mathbb{R}^{m\times k}$ 1 per block.

In practice, pooling, dropout, and normalization are implemented as grouped operations and pointwise transformations, ensuring compatibility with PyTorch or TensorFlow. This architecture enables scalable, expressive, and tractable multimodal fusion applicable to large-scale VQA and related multimodal tasks.

7. Significance and Research Impact

MFB and MFH provide a principled and computationally feasible approach to multimodal fusion, capturing rich cross-modal interactions without incurring the combinatorial explosion of full bilinear pooling. The effectiveness of higher-order pooling is empirically established, with ablations confirming that elementwise product cascades enable richer combinatorial patterns between vision and language representations. These methods have advanced the state of the art in VQA (runner-up, VQA Challenge 2017) and set a reference point for subsequent multimodal fusion research (Yu et al., 2017).

Markdown Report Issue Upgrade to Chat

References (1)

Beyond Bilinear: Generalized Multimodal Factorized High-order Pooling for Visual Question Answering (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multimodal Factorized Bilinear (MFB).

MFB & MFH: Efficient Multimodal Fusion

1. Mathematical Foundations

2. Relationship to MLB, MCB, and Expressivity

3. Integration into VQA Architectures

4. Training Methods and Regularization

5. Empirical Results and Ablation Studies

6. Implementation and Computational Characteristics

7. Significance and Research Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MFB & MFH: Efficient Multimodal Fusion

1. Mathematical Foundations

2. Relationship to MLB, MCB, and Expressivity

3. Integration into VQA Architectures

4. Training Methods and Regularization

5. Empirical Results and Ablation Studies

6. Implementation and Computational Characteristics

7. Significance and Research Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research