Low-rank Multimodal Fusion (LMF)

Updated 17 November 2025

LMF is a tensor-based fusion technique that uses low-rank factorizations (CP/Tucker) to efficiently combine multiple modalities.
It reduces the parameter count drastically compared to full tensor fusion while maintaining the ability to model complex cross-modal interactions.
LMF is applied in various architectures—from sentiment analysis and VQA to hardware-efficient models—delivering faster training and inference.

Low-rank Multimodal Fusion (LMF) is a class of tensor-based methodologies for constructing expressive yet efficient joint representations from multiple modalities. By leveraging tensor factorization, typically in the form of CANDECOMP/PARAFAC (CP) or Tucker decompositions, LMF circumvents the combinatorial growth of parameters inherent in naïve tensor fusion while preserving the ability to model higher-order multiplicative interactions among modalities. LMF has become foundational across multimodal sentiment analysis, visual question answering, efficient transfer learning, and hardware-efficient architectures.

1. Mathematical Formulation of Low-rank Multimodal Fusion

Full multimodal tensor fusion captures all orders of cross-modal interaction by projecting the outer product of unimodal representations: $\mathcal{Z} = z_1 \otimes z_2 \otimes \cdots \otimes z_M \in \mathbb{R}^{d_1 \times d_2 \times \cdots \times d_M}$ with $M$ modalities, each $z_m \in \mathbb{R}^{d_m}$ . This tensor is projected to a low-dimensional fused representation by an order- $(M+1)$ weight tensor $\mathcal{W} \in \mathbb{R}^{d_1 \times \cdots \times d_M \times d_h}$ : $h = \mathrm{vec}(\mathcal{W} \cdot \mathcal{Z}) + b \quad\in\quad \mathbb{R}^{d_h}$ However, $\mathcal{W}$ contains $O(d_h \prod_{m=1}^M d_m)$ parameters, which is prohibitive even for moderate $M$ or $d_m$ .

To address this, LMF replaces $\mathcal{W}$ by a sum of $r$ rank-1 tensors (CP decomposition): $\mathcal{W} \approx \sum_{i=1}^r W_1^{(i)} \otimes W_2^{(i)} \otimes \cdots \otimes W_M^{(i)}$ with $W_m^{(i)} \in \mathbb{R}^{d_m \times d_h}$ . Owing to multilinearity,

$h = \sum_{i=1}^r \left(W_1^{(i)} z_1\right) \circ \cdots \circ \left(W_M^{(i)} z_M\right)$

where $\circ$ denotes element-wise multiplication. This avoids explicit construction of high-order tensors and reduces the parameter count to $O(r \sum_{m=1}^M d_m d_h)$ .

Classical implementations bias-augment input vectors to capture lower-order interactions (Liu et al., 2018, Sahay et al., 2020). The fused vector may be further projected by an output layer or used directly.

Extensions include:

Tucker factorization (as in MUTAN (Ben-Younes et al., 2017)), which uses a small, learnable core $\mathcal{C}$ and orthogonal projections along each mode.
Nested or token-level fusions (e.g., in sequence adapters (Guo et al., 12 Dec 2024)), where two CP decompositions factor both feature and sequence axes, allowing token-level cross-modal interactions.

2. Computational and Statistical Efficiency

LMF’s primary advantage is the linear scaling of parameters and computation with respect to the number and size of modalities, versus the exponential scaling seen in naïve outer-product fusion. Comparative summaries:

Fusion	Parameter Count	Complexity
Full Tensor	$d_h \prod_{m=1}^M d_m$	$O(d_h \prod_{m=1}^M d_m)$
LMF (CP)	$r \sum_{m=1}^M d_m d_h$	$O(r d_h \sum_{m=1}^M d_m)$
Tucker/MUTAN	$d_1 t_1 + d_2 t_2 + ... + R(\sum t_m t_h )$	See (Ben-Younes et al., 2017)

Practical instantiations show drastic savings: for $M=3, d_m\sim 32\text{--}64, d_h=1, r=4$ , the Tensor Fusion Network (TFN) requires $\approx 12.5 \text{M}$ parameters, versus LMF with $\approx 1.1 \text{M}$ (Liu et al., 2018). Modern sequence-level LMF adapters reduce backbone fine-tuning from hundreds of millions to $<1$ million trainable parameters (Guo et al., 12 Dec 2024).

Training and inference are correspondingly accelerated. On CMU-MOSI, LMF training is $3.3\times$ and inference $1.9\times$ faster than TFN (Liu et al., 2018); transformer variants achieve $30$-- $40\%$ lower memory and wall-clock time than full cross-modal attention (Sahay et al., 2020).

Empirically, modest ranks ( $r=2$ --$8$) suffice for strong performance, while higher ranks risk overfitting or diminishing returns (Liu et al., 2018, Guo et al., 12 Dec 2024).

3. Architectural Instantiations and Extensions

LMF has been operationalized in diverse architectures:

Classical Feedforward Fusion: As in (Liu et al., 2018), fused multiplications feed directly into regression or classification layers for tasks like sentiment analysis or trait prediction.
Fusion-based Transformers: In (Sahay et al., 2020), LMF is integrated with transformer cross-attention. Modality-specific encoders (LSTMs or transformers) produce context vectors: fusion proceeds via LMF, and the result conditions subsequent cross-modal attention. This approach supports both fused-driven and modality-driven attention flow.
Multimodal Adapters: The loW-rank sequence multimodal adapter (“Wander” (Guo et al., 12 Dec 2024)) extends LMF to token-level CP decompositions within adapters inserted into frozen transformer backbones. This generalizes LMF beyond vector fusion to sequence-level cross-modal interactions and achieves up to $99\%$ parameter savings relative to full fine-tuning.
Optical/Energy-efficient LMF: TOMFN (Zhao et al., 2023) introduces the mapping of CP and Tensor-Train (TT) decompositions to cascaded Mach–Zehnder interferometer (MZI) arrays in photonic hardware. All LMF operations—basic projections, Hadamard products, even self-attentive text encoders—are realized optically within multiplexed, low-rank tensor-core meshes.
VQA Bilinear/Tucker LMF: In MUTAN (Ben-Younes et al., 2017), Tucker decomposition and additional low-rank core constraints efficiently parametrize bilinear question–image fusion and outperform both CP (MLB) and fixed-sketch (MCB) baselines, confirming the regularization and expressivity benefits of structured low-rankness.

4. Empirical Results and Applications

LMF has demonstrated broad competitiveness across diverse multimodal tasks:

Sentiment & Emotion Analysis: On CMU-MOSI, POM, and IEMOCAP, LMF surpasses or matches full fusion networks (e.g., MAE decreases from $0.970$ to $0.912$; Pearson correlation increases from $0.633$ to $0.668$), with $\sim 11\times$ parameter reduction (Liu et al., 2018). In transformer-based models (Sahay et al., 2020), LMF-MulT achieves Acc $_2 \approx 77.9\%$ , F1 $\approx 77.9\%$ , within $1\%$ of much larger models.
Visual Question Answering: MUTAN achieves $60.2\%$ (single pass, no attn), improving to $67.4\%$ “all” accuracy with attention/ensembles, outperforming random-sketch and CP-only (MLB) on VQA test-dev, while requiring orders-of-magnitude fewer parameters (Ben-Younes et al., 2017).
Transfer Learning: Wander demonstrates $95$– $99\%$ parameter savings while matching or exceeding full fine-tuning across 2–7 modality settings (UMPC-Food101, IEMOCAP, MSRVTT), e.g., on CMU-MOSI, 80M → $0.3$–$0.9$M trainables, F1 improved by $0.6$ points (Guo et al., 12 Dec 2024).
Hardware-efficient LMF: TOMFN reports $51.3\times$ hardware reduction (MZI count) and $3.7\times 10^{13}$ MAC/J energy efficiency with competitive macro F1 on IEMOCAP (Zhao et al., 2023).

5. Trade-offs, Limitations, and Variants

The choice of rank parameter $r$ (CP or Tucker) directly trades representation capacity for compactness and regularization. Small $r$ suffices in practice—very low values already outperform full fusion networks in core tasks—but excessive $r$ can degrade training stability or induce overfitting (Liu et al., 2018). Implementation-specific details (output bottleneck size $d$ , location of bias terms, output projections, or up-projection elimination as in (Guo et al., 12 Dec 2024)) allow further tuning.

Extensions and caveats:

Nested or token-level low-rank decompositions add complexity but maintain feasible inference and drastically reduce memory (down $200$-- $500\times$ vs. naïve sequence fusion) (Guo et al., 12 Dec 2024).
Optical hardware designs hinge on mapping all linear and nonlinear tensor operations to composable photonic primitives—summation, Hadamard products, and unitary projections are realized physically, tied to MZI mesh design (Zhao et al., 2023).
LMF is inherently versatile: applicable to any number of modalities, adaptable to both vector and sequence-level representations, and combinable with hybrid factorizations (e.g., Tucker, block-sparse) (Ben-Younes et al., 2017, Guo et al., 12 Dec 2024).
In some architectures, time or sequence structure is only partially recovered; direct time-aware LMF (applying fusion at every time step rather than post-aggregation) remains an area for expansion (Sahay et al., 2020).
Nested CP decompositions, as in Wander, involve more implementation overhead and modestly higher inference cost, but empirical results suggest these costs are offset by substantial memory and parameter efficiencies (Guo et al., 12 Dec 2024).

6. Relationships to Prior Art and Theoretical Significance

LMF subsumes and generalizes prior tensor-based multimodal fusion schemes:

CP Decomposition: MLB (Ben-Younes et al., 2017) employs canonical polyadic (CP) decomposition but does not enable the full flexibility of Tucker or parameter sharing.
Tucker Decomposition: MUTAN (Ben-Younes et al., 2017) generalizes the CP/MUTAN-MLB relationship, enabling interpretable low-rank slices in the fusion core, with adjustable interaction rank $R$ .
Random Projections: MCB’s fixed random sign-diagonal models are strictly subsumed by learnable Tucker and CP variants.
Adapters and Sequence-Level Fusion: “Wander” (Guo et al., 12 Dec 2024) extends LMF to token-level cross-modal adapters, unifying low-rank factorization with residual parameter-efficient transfer learning.

All LMF methods share the central premise: expressive cross-modal fusion need not require full tensor expansion. Instead, mathematically principled low-rank factorizations endow models with the necessary expressivity while ensuring extremal parameter and memory efficiency, regularizing against overfitting, and in some cases, enabling new hardware or transfer paradigms.

7. Future Directions and Open Challenges

Promising avenues for further research include:

Dynamic or adaptive selection of rank per layer or modality (Guo et al., 12 Dec 2024)
Extension of LMF to very high-modality (e.g., vision, language, audio, event, video) transformers and large-scale multimodal pretraining (Guo et al., 12 Dec 2024)
Hybrid fusions (combining CP, Tucker, block-sparse or other structured factorizations)
Learned compression for long-range sequence modalities (video, speech)
Hardware code-sign with photonic or other analog accelerators (Zhao et al., 2023)
Comprehensive ablation studies over LMF rank, convolutional augmentations, and cross-modal projection variants (not fully reported in early works (Sahay et al., 2020))

As LMF techniques continue to find new application domains and hardware contexts, the balance between compact modeling, statistical expressivity, and implementational feasibility will remain central to multimodal machine learning.