Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multimodal Factorized High-Order Pooling

Updated 23 April 2026
  • The paper introduces MFH, which extends MFB pooling by cascading factorized blocks to capture (p+1)-th order interactions between modalities.
  • MFH leverages low-rank approximations, dropout, and normalization techniques to achieve expressive multimodal representations while maintaining computational efficiency.
  • MFH is integrated within VQA pipelines using co-attention mechanisms, leading to state-of-the-art accuracy improvements on major benchmarks.

Multimodal Factorized High-order Pooling (MFH) is a deep learning fusion strategy designed to capture rich and complex interactions among features from distinct modalities, with primary applications in visual question answering (VQA). MFH generalizes standard bilinear pooling by factorizing the weight tensors of bilinear fusion into compact low-rank blocks and cascading these factorized modules to achieve higher-order feature interactions. This approach enables more expressive multimodal representations while maintaining manageable parameterization and computational efficiency (Yu et al., 2017).

1. Mathematical Formulation of MFH

Given an image feature vector x∈Rmx \in \mathbb{R}^m and a question feature vector y∈Rny \in \mathbb{R}^n, second-order (bilinear) pooling seeks to capture all multiplicative interactions between xx and yy via

zi=x⊤Wiy,Wi∈Rm×n,i=1…o,z_i = x^\top W_i y, \quad W_i \in \mathbb{R}^{m \times n},\quad i=1\ldots o,

producing an output z∈Roz \in \mathbb{R}^o. Direct parametrization is intractable for large mm, nn, and oo due to the cubic scaling. Multimodal Factorized Bilinear (MFB) pooling addresses this by approximating each WiW_i with two low-rank matrices y∈Rny \in \mathbb{R}^n0, y∈Rny \in \mathbb{R}^n1 (y∈Rny \in \mathbb{R}^n2):

y∈Rny \in \mathbb{R}^n3

where y∈Rny \in \mathbb{R}^n4/y∈Rny \in \mathbb{R}^n5 denote columns of y∈Rny \in \mathbb{R}^n6/y∈Rny \in \mathbb{R}^n7 respectively. For all y∈Rny \in \mathbb{R}^n8 outputs, y∈Rny \in \mathbb{R}^n9 and xx0 are reshaped to xx1 and xx2, yielding vectors xx3, followed by block-wise SumPool to xx4.

MFH extends this by cascading xx5 such MFB blocks, enabling the model to encode up to xx6-th order interactions between xx7 and xx8. For block xx9, the expanded feature vector is:

  • yy0
  • yy1, yy2
  • yy3

The outputs yy4 of all yy5 blocks are concatenated, yielding the final MFH representation yy6.

2. Algorithmic Workflow and Pseudocode

The MFH pooling sequence comprises repeated application of three stages:

  1. Expand: Linear projections of yy7 and yy8 via yy9 and zi=x⊤Wiy,Wi∈Rm×n,i=1…o,z_i = x^\top W_i y, \quad W_i \in \mathbb{R}^{m \times n},\quad i=1\ldots o,0, followed by Hadamard product and dropout.
  2. Cascade: Element-wise multiplication with the previous block’s expansion output.
  3. Squeeze and Normalize: Block-wise sum pooling, followed by element-wise signed square-root (power normalization) and zi=x⊤Wiy,Wi∈Rm×n,i=1…o,z_i = x^\top W_i y, \quad W_i \in \mathbb{R}^{m \times n},\quad i=1\ldots o,1-normalization.

The high-level pseudocode for MFHzi=x⊤Wiy,Wi∈Rm×n,i=1…o,z_i = x^\top W_i y, \quad W_i \in \mathbb{R}^{m \times n},\quad i=1\ldots o,2 is as follows: nn6 Both normalization steps—signed square-root and zi=x⊤Wiy,Wi∈Rm×n,i=1…o,z_i = x^\top W_i y, \quad W_i \in \mathbb{R}^{m \times n},\quad i=1\ldots o,3 normalization—are essential to prevent instability and excessive neuron magnitudes.

3. Hyperparameter Selection and Design Considerations

MFH’s expressivity and efficiency depend critically on three main hyperparameters:

  • Factor dimension zi=x⊤Wiy,Wi∈Rm×n,i=1…o,z_i = x^\top W_i y, \quad W_i \in \mathbb{R}^{m \times n},\quad i=1\ldots o,4: Controls the low-rank approximation per bilinear slice. Typical range is zi=x⊤Wiy,Wi∈Rm×n,i=1…o,z_i = x^\top W_i y, \quad W_i \in \mathbb{R}^{m \times n},\quad i=1\ldots o,5, with zi=x⊤Wiy,Wi∈Rm×n,i=1…o,z_i = x^\top W_i y, \quad W_i \in \mathbb{R}^{m \times n},\quad i=1\ldots o,6 empirically effective on VQA benchmarks.
  • Output subdimension zi=x⊤Wiy,Wi∈Rm×n,i=1…o,z_i = x^\top W_i y, \quad W_i \in \mathbb{R}^{m \times n},\quad i=1\ldots o,7: Determines each block's pooled vector size; total MFH output has dimension zi=x⊤Wiy,Wi∈Rm×n,i=1…o,z_i = x^\top W_i y, \quad W_i \in \mathbb{R}^{m \times n},\quad i=1\ldots o,8. Recommended values of zi=x⊤Wiy,Wi∈Rm×n,i=1…o,z_i = x^\top W_i y, \quad W_i \in \mathbb{R}^{m \times n},\quad i=1\ldots o,9 are in the range z∈Roz \in \mathbb{R}^o0–z∈Roz \in \mathbb{R}^o1 (default z∈Roz \in \mathbb{R}^o2).
  • Number of blocks z∈Roz \in \mathbb{R}^o3: Higher z∈Roz \in \mathbb{R}^o4 enables modeling higher-order interactions. Experiments show z∈Roz \in \mathbb{R}^o5 (MFHz∈Roz \in \mathbb{R}^o6) achieves a strong balance, with diminishing benefit beyond z∈Roz \in \mathbb{R}^o7.

A practical guideline is to constrain z∈Roz \in \mathbb{R}^o8 (intermediate expansion dimension) to several thousand. Dropout and normalization must be applied at each stage for stable training.

4. Integration within VQA Architectures

In advanced VQA networks, MFH is integrated into a three-stage architecture:

  • Feature Extraction:
    • Image: CNN (e.g., ResNet-152) yields either a global z∈Roz \in \mathbb{R}^o9-D vector or a mm0 feature map.
    • Question: Tokenized words are embedded and passed through an LSTM; self-attention over output sequences produces attentive question features.
  • Co-Attention Module:
    • Question self-attention: Attentive reduction of LSTM outputs.
    • Image attention: Each spatial image region feature is fused with the question representation via a lightweight MFB block, followed by softmax attention weighting.
  • Final Fusion and Answer Prediction:

    • The attended image and question features are fused with the main MFH (or MFB for mm1) module to give mm2.
    • A fully connected layer projects mm3 to logits over the answer vocabulary.
    • Loss is computed using Kullback–Leibler divergence between normalized answer histograms mm4 and predicted distributions mm5:

    mm6

    This approach leverages the multi-annotation nature of VQA labeling and achieves faster convergence and slightly higher accuracy than single-label cross-entropy or randomized answer sampling.

5. Empirical Evaluation and Comparative Analysis

MFH has been evaluated across major VQA benchmarks:

  • VQA-1.0 (test-dev, Open-Ended "All" accuracy):

| Fusion Method | Accuracy (%) | |-------------------------------|--------------| | Concat/Sum/Prod | 57–58 | | Multimodal Compact Bilinear | 59.8 | | Multimodal Low-Rank Bilinear | 59.7 | | MFB (mm7) | 60.9 | | MFHmm8 (mm9) | 61.6 | | MFHnn0 | 61.5 | | MFB+CoAtt | 64.6 | | MFH+CoAtt | 65.8 | | MFH+CoAtt+GloVe+VG | 67.7 | | 7x MFH Ensemble | 69.2 |

  • VQA-2.0 (test-dev):

| Fusion Method | Accuracy (%) | |-------------------------------|--------------| | MFB+CoAtt+GloVe | 64.98 | | MFH+CoAtt+GloVe | 65.80 | | 9x MFH Ensemble | 68.02 |

These results demonstrate consistent improvement of MFH over first-order (concatenation, sum, product), Multimodal Compact Bilinear, and Multimodal Low-Rank Bilinear pooling baselines. Co-attention integration further boosts performance. MFH achieved new state-of-the-art performance on VQA-1.0 and VQA-2.0 and was runner-up in VQA Challenge 2017 (Yu et al., 2017).

Ablation studies show that both power and nn1 normalization are crucial; omitting them leads to unstable training and 2–3% accuracy drops. The answer-distribution KLD loss also yields materially faster convergence.

6. Context, Extensions, and Significance

MFH situates itself as a generalization of bilinear pooling, specifically through low-rank factorization and block-wise cascading. When nn2, MFH reduces to MLB (Multimodal Low-Rank Bilinear) pooling; for nn3, MFH is equivalent to MFB. The cascaded construction enables higher-order feature interactions without the exponential parameter cost of naïve outer product expansion.

The underlying architectural choices—including co-attention, normalization, and KLD-based label smoothing—have demonstrably positive effects on model convergence and generalization. MFH's parameterization is modular, permitting adaptation to different fusion depths, dimensions, and downstream tasks involving multimodal inputs.

A plausible implication is that further extension of MFH to alternative modalities (e.g., speech-text, video-text) or higher-order reasoning tasks could benefit from the same architectural principles, as long as careful regulation of parameter count and normalization is maintained.

7. Summary of Key Properties and Findings

MFH achieves effective multimodal feature fusion by:

  • Employing low-rank factorized bilinear pooling within each block to manage computational complexity.
  • Cascading multiple such blocks to capture high-order (nn4-th) interactions, providing richer cross-modal representations.
  • Integrating fully within state-of-the-art VQA pipelines, leveraging co-attention mechanisms for both vision and language.
  • Demonstrating empirically validated improvements over previous bilinear and low-rank models.
  • Requiring, for stable and accurate training, the use of power and nn5 normalization at each pooling stage and Kullback–Leibler loss for answer prediction.

These properties substantiate MFH’s role as an effective, scalable, and theoretically motivated multimodal fusion technique in challenging real-world tasks such as VQA (Yu et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multimodal Factorized High-order Pooling (MFH).