Multimodal Factorized High-Order Pooling
- The paper introduces MFH, which extends MFB pooling by cascading factorized blocks to capture (p+1)-th order interactions between modalities.
- MFH leverages low-rank approximations, dropout, and normalization techniques to achieve expressive multimodal representations while maintaining computational efficiency.
- MFH is integrated within VQA pipelines using co-attention mechanisms, leading to state-of-the-art accuracy improvements on major benchmarks.
Multimodal Factorized High-order Pooling (MFH) is a deep learning fusion strategy designed to capture rich and complex interactions among features from distinct modalities, with primary applications in visual question answering (VQA). MFH generalizes standard bilinear pooling by factorizing the weight tensors of bilinear fusion into compact low-rank blocks and cascading these factorized modules to achieve higher-order feature interactions. This approach enables more expressive multimodal representations while maintaining manageable parameterization and computational efficiency (Yu et al., 2017).
1. Mathematical Formulation of MFH
Given an image feature vector and a question feature vector , second-order (bilinear) pooling seeks to capture all multiplicative interactions between and via
producing an output . Direct parametrization is intractable for large , , and due to the cubic scaling. Multimodal Factorized Bilinear (MFB) pooling addresses this by approximating each with two low-rank matrices 0, 1 (2):
3
where 4/5 denote columns of 6/7 respectively. For all 8 outputs, 9 and 0 are reshaped to 1 and 2, yielding vectors 3, followed by block-wise SumPool to 4.
MFH extends this by cascading 5 such MFB blocks, enabling the model to encode up to 6-th order interactions between 7 and 8. For block 9, the expanded feature vector is:
- 0
- 1, 2
- 3
The outputs 4 of all 5 blocks are concatenated, yielding the final MFH representation 6.
2. Algorithmic Workflow and Pseudocode
The MFH pooling sequence comprises repeated application of three stages:
- Expand: Linear projections of 7 and 8 via 9 and 0, followed by Hadamard product and dropout.
- Cascade: Element-wise multiplication with the previous block’s expansion output.
- Squeeze and Normalize: Block-wise sum pooling, followed by element-wise signed square-root (power normalization) and 1-normalization.
The high-level pseudocode for MFH2 is as follows: 6 Both normalization steps—signed square-root and 3 normalization—are essential to prevent instability and excessive neuron magnitudes.
3. Hyperparameter Selection and Design Considerations
MFH’s expressivity and efficiency depend critically on three main hyperparameters:
- Factor dimension 4: Controls the low-rank approximation per bilinear slice. Typical range is 5, with 6 empirically effective on VQA benchmarks.
- Output subdimension 7: Determines each block's pooled vector size; total MFH output has dimension 8. Recommended values of 9 are in the range 0–1 (default 2).
- Number of blocks 3: Higher 4 enables modeling higher-order interactions. Experiments show 5 (MFH6) achieves a strong balance, with diminishing benefit beyond 7.
A practical guideline is to constrain 8 (intermediate expansion dimension) to several thousand. Dropout and normalization must be applied at each stage for stable training.
4. Integration within VQA Architectures
In advanced VQA networks, MFH is integrated into a three-stage architecture:
- Feature Extraction:
- Co-Attention Module:
- Question self-attention: Attentive reduction of LSTM outputs.
- Image attention: Each spatial image region feature is fused with the question representation via a lightweight MFB block, followed by softmax attention weighting.
- Final Fusion and Answer Prediction:
- The attended image and question features are fused with the main MFH (or MFB for 1) module to give 2.
- A fully connected layer projects 3 to logits over the answer vocabulary.
- Loss is computed using Kullback–Leibler divergence between normalized answer histograms 4 and predicted distributions 5:
6
This approach leverages the multi-annotation nature of VQA labeling and achieves faster convergence and slightly higher accuracy than single-label cross-entropy or randomized answer sampling.
5. Empirical Evaluation and Comparative Analysis
MFH has been evaluated across major VQA benchmarks:
- VQA-1.0 (test-dev, Open-Ended "All" accuracy):
| Fusion Method | Accuracy (%) | |-------------------------------|--------------| | Concat/Sum/Prod | 57–58 | | Multimodal Compact Bilinear | 59.8 | | Multimodal Low-Rank Bilinear | 59.7 | | MFB (7) | 60.9 | | MFH8 (9) | 61.6 | | MFH0 | 61.5 | | MFB+CoAtt | 64.6 | | MFH+CoAtt | 65.8 | | MFH+CoAtt+GloVe+VG | 67.7 | | 7x MFH Ensemble | 69.2 |
- VQA-2.0 (test-dev):
| Fusion Method | Accuracy (%) | |-------------------------------|--------------| | MFB+CoAtt+GloVe | 64.98 | | MFH+CoAtt+GloVe | 65.80 | | 9x MFH Ensemble | 68.02 |
These results demonstrate consistent improvement of MFH over first-order (concatenation, sum, product), Multimodal Compact Bilinear, and Multimodal Low-Rank Bilinear pooling baselines. Co-attention integration further boosts performance. MFH achieved new state-of-the-art performance on VQA-1.0 and VQA-2.0 and was runner-up in VQA Challenge 2017 (Yu et al., 2017).
Ablation studies show that both power and 1 normalization are crucial; omitting them leads to unstable training and 2–3% accuracy drops. The answer-distribution KLD loss also yields materially faster convergence.
6. Context, Extensions, and Significance
MFH situates itself as a generalization of bilinear pooling, specifically through low-rank factorization and block-wise cascading. When 2, MFH reduces to MLB (Multimodal Low-Rank Bilinear) pooling; for 3, MFH is equivalent to MFB. The cascaded construction enables higher-order feature interactions without the exponential parameter cost of naïve outer product expansion.
The underlying architectural choices—including co-attention, normalization, and KLD-based label smoothing—have demonstrably positive effects on model convergence and generalization. MFH's parameterization is modular, permitting adaptation to different fusion depths, dimensions, and downstream tasks involving multimodal inputs.
A plausible implication is that further extension of MFH to alternative modalities (e.g., speech-text, video-text) or higher-order reasoning tasks could benefit from the same architectural principles, as long as careful regulation of parameter count and normalization is maintained.
7. Summary of Key Properties and Findings
MFH achieves effective multimodal feature fusion by:
- Employing low-rank factorized bilinear pooling within each block to manage computational complexity.
- Cascading multiple such blocks to capture high-order (4-th) interactions, providing richer cross-modal representations.
- Integrating fully within state-of-the-art VQA pipelines, leveraging co-attention mechanisms for both vision and language.
- Demonstrating empirically validated improvements over previous bilinear and low-rank models.
- Requiring, for stable and accurate training, the use of power and 5 normalization at each pooling stage and Kullback–Leibler loss for answer prediction.
These properties substantiate MFH’s role as an effective, scalable, and theoretically motivated multimodal fusion technique in challenging real-world tasks such as VQA (Yu et al., 2017).