Factorized Multimodal Transformer (FMT)
- The paper introduces a factorized attention mechanism that decomposes modality interactions into all nonempty subsets for fine-grained modeling.
- It employs dedicated self-attention channels and lightweight CNN summarization to achieve state-of-the-art performance on sentiment, emotion, and speaker trait datasets.
- Ablation studies show that removing any modality factor drops accuracy by 1–2%, underlining the critical role of semantic factorization for efficiency.
The Factorized Multimodal Transformer (FMT) is a neural architecture designed for multimodal sequential learning, explicitly modeling comprehensive intra- and inter-modal dynamics across temporal sequences. FMT is distinguished by a factorization strategy that decomposes the attention space into all nonempty subsets (“factors”) of the input modalities, allowing for fine-grained and semantically targeted modeling of modality interactions at multiple interaction scales and facilitating robust learning even in low-resource settings (Zadeh et al., 2019). Subsequent efforts have generalized or specialized the factorized attention paradigm, such as in action recognition and autonomous driving contexts (Yang et al., 2023, Zheng et al., 15 Aug 2025).
1. Architectural Principles and Factorization of Modal Interactions
FMT is structured around the explicit enumeration of all nonempty subsets, or “factors,” of the modality set (typically : language, vision, acoustic). For each factor, a dedicated self-attention channel is applied over the subspace corresponding to the modalities in that factor, enabling asynchronous modeling of unimodal, bimodal, and trimodal dependencies across the entire temporal sequence.
Input embedding is performed per modality using modality-specific linear projections followed by positional encoding. Resulting embeddings are aligned to a common clock and concatenated to form the initial sequence input .
The transformer core consists of stacked Multimodal Transformer Layers (MTL). Each MTL contains parallel Factorized Multimodal Self-attention (FMS) units, each applying attention for all factors in parallel (seven in the three-modality case: , , , , , 0, 1). The output for each factor is residualized and normalized, then aggregated via a lightweight 1D convolutional summarization network (2). 3 parallel FMS units capture distinct semantic patterns, and a second summarizer (4) collapses across units to yield the MTL output (Zadeh et al., 2019).
2. Attention Mechanisms and Mathematical Formalization
For each factor 5, the FMS layer projects the input sequence onto the subspace corresponding to 6, yielding 7. Factor-specific attention parameters 8 yield
9
Scaled dot-product attention is computed as
0
with a residual and layer normalization: 1 Outputs from all factors are stacked and reduced by 2 to 3. 4 FMS units are summarized by 5 as
6
where each 7 is the result of feed-forward, residual, and normalization on the 8-th FMS output. The final output after 9 layers is processed via a unidirectional GRU, whose last hidden state is linearly mapped to the prediction space.
Each attention head, for every factor, has full temporal receptive field, permitting modeling of long-range and asynchronous dependencies. This factorization distinguishes FMT: the heads are not split along dimension, but along semantic decomposition of modalities (Zadeh et al., 2019).
3. Optimization, Training Protocols, and Parameter Efficiency
The FMT design incorporates modality-specific features (GloVe/P2FA [language], Emotient FACET [vision], COVAREP [acoustic]) and standard sequence alignment. Training uses Adam optimization (learning rates 0–1, 2, 3), batch size 20, and up to 200 epochs with early stopping. Dropout (4–5) is used at key points. Lightweight summarization CNNs (6, 7) regularize the factor/unit aggregations.
Parameter sharing is achieved by representing all factor interactions within a single model stack, in contrast to models such as MulT which instantiate multiple separate transformer streams (Zadeh et al., 2019). Empirical analysis establishes that FMT achieves its robust generalization and high accuracy without an explosion in parameter count, leveraging semantically targeted attention rather than simply increasing the number of heads.
4. Empirical Results and Comparative Evaluation
FMT establishes state-of-the-art results on three well-studied datasets and 21 label types:
- CMU-MOSI (Sentiment Analysis): FMT achieves BA=81.5%/83.5% (neg/non-neg; neg/pos), F1=81.4/83.5, MAE=0.837, Corr=0.744. Against MulT’s best BA=83.0, F1=82.8, MAE=0.87, Corr=0.698, FMT shows improvements in F1 and correlation (Zadeh et al., 2019).
- IEMOCAP (Emotion Recognition): For “Sad,” FMT: BA=88.0 vs. MulT: 86.7; “Angry,” FMT: 89.7 vs. MulT: 87.4; “Neutral,” FMT: 74.0 vs. 72.4.
- POM (Speaker Traits): FMT outperforms MulT across all 16 traits. E.g., "Confident": MA7=40.9% vs. 34.5%; "Passionate": 42.4% vs. 34.5%; "Humorous": 48.3% vs. 43.3%.
Ablation studies demonstrate that removing any factor category (unimodal, bimodal, trimodal) leads to performance drops of 1–2% BA, and substituting summarization with naive reductions loses approximately 1% BA. FMT with a single FMS unit (U=1, 7 heads) outperforms a standard Transformer with up to 35 heads (full temporal field) by >5% BA, confirming semantic factorization as central (Zadeh et al., 2019).
5. Generalization, Scalability, and Related Models
Scalability: The number of factors grows exponentially with the number of modalities (8), presenting computational challenges for applications beyond three modalities. Practical mitigations include domain-informed pruning of weak factors and greedy or stepwise factor selection.
Generalization: FMT is robust under low-resource conditions, attributed to its capacity control through factorized and summarized attention, permitting effective learning from datasets ranging from ~2,000 to 5,000 samples (Zadeh et al., 2019).
Parameter Efficiency: FMT’s factorized structure attains parameter savings versus architectures that implement parallel unimodal and cross-modal transformer stacks. Analogous factorized paradigms have been developed subsequently. The Unified Contrastive Fusion Transformer (UCFFormer) utilizes Factorized Time–Modality Attention for modality-and-time factorization, yielding substantial FLOP and parameter reductions over full joint attention, with no accuracy loss (Yang et al., 2023). In the autonomous driving domain, MultiPark achieves explicit factorization over gear, longitudinal, and lateral parking behaviors for efficient multimodal path prediction, outperforming non-factorized baselines in both accuracy and speed (Zheng et al., 15 Aug 2025).
| Model | Main Factorization Strategy | Parameter Efficiency | Representative Domains |
|---|---|---|---|
| FMT | Explicit (all subsets of modalities) | High | Multimodal sequential learning |
| UCFFormer | Time–Modality axis | High | Human action recognition |
| MultiPark | Parking behavior (gear/lon/lat modes) | High | Autonomous vehicle planning |
6. Limitations and Future Directions
Several open challenges remain. The exponential growth of factors for large 9 motivates automated factor pruning, dynamic gating per example, or cross-layer/hierarchical strategies. Integration with pretrained multimodal representations (e.g., vision–language transformers), extension to online or more fine-grained alignment (frame-level), and exploitation of outcome-oriented objectives are cited as active research areas. The principled application of factorization—targeting semantic intersections of modalities—remains central to recent advances in multimodal modeling, as evidenced by extensions in action recognition and control domains (Zadeh et al., 2019, Yang et al., 2023, Zheng et al., 15 Aug 2025).