Papers
Topics
Authors
Recent
Search
2000 character limit reached

Factorized Multimodal Transformer (FMT)

Updated 2 May 2026
  • The paper introduces a factorized attention mechanism that decomposes modality interactions into all nonempty subsets for fine-grained modeling.
  • It employs dedicated self-attention channels and lightweight CNN summarization to achieve state-of-the-art performance on sentiment, emotion, and speaker trait datasets.
  • Ablation studies show that removing any modality factor drops accuracy by 1–2%, underlining the critical role of semantic factorization for efficiency.

The Factorized Multimodal Transformer (FMT) is a neural architecture designed for multimodal sequential learning, explicitly modeling comprehensive intra- and inter-modal dynamics across temporal sequences. FMT is distinguished by a factorization strategy that decomposes the attention space into all nonempty subsets (“factors”) of the input modalities, allowing for fine-grained and semantically targeted modeling of modality interactions at multiple interaction scales and facilitating robust learning even in low-resource settings (Zadeh et al., 2019). Subsequent efforts have generalized or specialized the factorized attention paradigm, such as in action recognition and autonomous driving contexts (Yang et al., 2023, Zheng et al., 15 Aug 2025).

1. Architectural Principles and Factorization of Modal Interactions

FMT is structured around the explicit enumeration of all nonempty subsets, or “factors,” of the modality set (typically {L,V,A}\{L, V, A\}: language, vision, acoustic). For each factor, a dedicated self-attention channel is applied over the subspace corresponding to the modalities in that factor, enabling asynchronous modeling of unimodal, bimodal, and trimodal dependencies across the entire temporal sequence.

Input embedding is performed per modality using modality-specific linear projections followed by positional encoding. Resulting embeddings m^(t,i)=EM(m(t,i))\hat m_{(t,i)} = E_M(m_{(t,i)}) are aligned to a common clock and concatenated to form the initial sequence input x^i0RT×ex\hat x_i^0 \in \mathbb{R}^{T \times e_x}.

The transformer core consists of KK stacked Multimodal Transformer Layers (MTL). Each MTL contains UU parallel Factorized Multimodal Self-attention (FMS) units, each applying attention for all factors in parallel (seven in the three-modality case: LL, VV, AA, LVLV, LALA, m^(t,i)=EM(m(t,i))\hat m_{(t,i)} = E_M(m_{(t,i)})0, m^(t,i)=EM(m(t,i))\hat m_{(t,i)} = E_M(m_{(t,i)})1). The output for each factor is residualized and normalized, then aggregated via a lightweight 1D convolutional summarization network (m^(t,i)=EM(m(t,i))\hat m_{(t,i)} = E_M(m_{(t,i)})2). m^(t,i)=EM(m(t,i))\hat m_{(t,i)} = E_M(m_{(t,i)})3 parallel FMS units capture distinct semantic patterns, and a second summarizer (m^(t,i)=EM(m(t,i))\hat m_{(t,i)} = E_M(m_{(t,i)})4) collapses across units to yield the MTL output (Zadeh et al., 2019).

2. Attention Mechanisms and Mathematical Formalization

For each factor m^(t,i)=EM(m(t,i))\hat m_{(t,i)} = E_M(m_{(t,i)})5, the FMS layer projects the input sequence onto the subspace corresponding to m^(t,i)=EM(m(t,i))\hat m_{(t,i)} = E_M(m_{(t,i)})6, yielding m^(t,i)=EM(m(t,i))\hat m_{(t,i)} = E_M(m_{(t,i)})7. Factor-specific attention parameters m^(t,i)=EM(m(t,i))\hat m_{(t,i)} = E_M(m_{(t,i)})8 yield

m^(t,i)=EM(m(t,i))\hat m_{(t,i)} = E_M(m_{(t,i)})9

Scaled dot-product attention is computed as

x^i0RT×ex\hat x_i^0 \in \mathbb{R}^{T \times e_x}0

with a residual and layer normalization: x^i0RT×ex\hat x_i^0 \in \mathbb{R}^{T \times e_x}1 Outputs from all factors are stacked and reduced by x^i0RT×ex\hat x_i^0 \in \mathbb{R}^{T \times e_x}2 to x^i0RT×ex\hat x_i^0 \in \mathbb{R}^{T \times e_x}3. x^i0RT×ex\hat x_i^0 \in \mathbb{R}^{T \times e_x}4 FMS units are summarized by x^i0RT×ex\hat x_i^0 \in \mathbb{R}^{T \times e_x}5 as

x^i0RT×ex\hat x_i^0 \in \mathbb{R}^{T \times e_x}6

where each x^i0RT×ex\hat x_i^0 \in \mathbb{R}^{T \times e_x}7 is the result of feed-forward, residual, and normalization on the x^i0RT×ex\hat x_i^0 \in \mathbb{R}^{T \times e_x}8-th FMS output. The final output after x^i0RT×ex\hat x_i^0 \in \mathbb{R}^{T \times e_x}9 layers is processed via a unidirectional GRU, whose last hidden state is linearly mapped to the prediction space.

Each attention head, for every factor, has full temporal receptive field, permitting modeling of long-range and asynchronous dependencies. This factorization distinguishes FMT: the heads are not split along dimension, but along semantic decomposition of modalities (Zadeh et al., 2019).

3. Optimization, Training Protocols, and Parameter Efficiency

The FMT design incorporates modality-specific features (GloVe/P2FA [language], Emotient FACET [vision], COVAREP [acoustic]) and standard sequence alignment. Training uses Adam optimization (learning rates KK0–KK1, KK2, KK3), batch size 20, and up to 200 epochs with early stopping. Dropout (KK4–KK5) is used at key points. Lightweight summarization CNNs (KK6, KK7) regularize the factor/unit aggregations.

Parameter sharing is achieved by representing all factor interactions within a single model stack, in contrast to models such as MulT which instantiate multiple separate transformer streams (Zadeh et al., 2019). Empirical analysis establishes that FMT achieves its robust generalization and high accuracy without an explosion in parameter count, leveraging semantically targeted attention rather than simply increasing the number of heads.

4. Empirical Results and Comparative Evaluation

FMT establishes state-of-the-art results on three well-studied datasets and 21 label types:

  • CMU-MOSI (Sentiment Analysis): FMT achieves BA=81.5%/83.5% (neg/non-neg; neg/pos), F1=81.4/83.5, MAE=0.837, Corr=0.744. Against MulT’s best BA=83.0, F1=82.8, MAE=0.87, Corr=0.698, FMT shows improvements in F1 and correlation (Zadeh et al., 2019).
  • IEMOCAP (Emotion Recognition): For “Sad,” FMT: BA=88.0 vs. MulT: 86.7; “Angry,” FMT: 89.7 vs. MulT: 87.4; “Neutral,” FMT: 74.0 vs. 72.4.
  • POM (Speaker Traits): FMT outperforms MulT across all 16 traits. E.g., "Confident": MA7=40.9% vs. 34.5%; "Passionate": 42.4% vs. 34.5%; "Humorous": 48.3% vs. 43.3%.

Ablation studies demonstrate that removing any factor category (unimodal, bimodal, trimodal) leads to performance drops of 1–2% BA, and substituting summarization with naive reductions loses approximately 1% BA. FMT with a single FMS unit (U=1, 7 heads) outperforms a standard Transformer with up to 35 heads (full temporal field) by >5% BA, confirming semantic factorization as central (Zadeh et al., 2019).

Scalability: The number of factors grows exponentially with the number of modalities (KK8), presenting computational challenges for applications beyond three modalities. Practical mitigations include domain-informed pruning of weak factors and greedy or stepwise factor selection.

Generalization: FMT is robust under low-resource conditions, attributed to its capacity control through factorized and summarized attention, permitting effective learning from datasets ranging from ~2,000 to 5,000 samples (Zadeh et al., 2019).

Parameter Efficiency: FMT’s factorized structure attains parameter savings versus architectures that implement parallel unimodal and cross-modal transformer stacks. Analogous factorized paradigms have been developed subsequently. The Unified Contrastive Fusion Transformer (UCFFormer) utilizes Factorized Time–Modality Attention for modality-and-time factorization, yielding substantial FLOP and parameter reductions over full joint attention, with no accuracy loss (Yang et al., 2023). In the autonomous driving domain, MultiPark achieves explicit factorization over gear, longitudinal, and lateral parking behaviors for efficient multimodal path prediction, outperforming non-factorized baselines in both accuracy and speed (Zheng et al., 15 Aug 2025).

Model Main Factorization Strategy Parameter Efficiency Representative Domains
FMT Explicit (all subsets of modalities) High Multimodal sequential learning
UCFFormer Time–Modality axis High Human action recognition
MultiPark Parking behavior (gear/lon/lat modes) High Autonomous vehicle planning

6. Limitations and Future Directions

Several open challenges remain. The exponential growth of factors for large KK9 motivates automated factor pruning, dynamic gating per example, or cross-layer/hierarchical strategies. Integration with pretrained multimodal representations (e.g., vision–language transformers), extension to online or more fine-grained alignment (frame-level), and exploitation of outcome-oriented objectives are cited as active research areas. The principled application of factorization—targeting semantic intersections of modalities—remains central to recent advances in multimodal modeling, as evidenced by extensions in action recognition and control domains (Zadeh et al., 2019, Yang et al., 2023, Zheng et al., 15 Aug 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Factorized Multimodal Transformer (FMT).