Tensor Fusion Network (TFN)

Updated 10 December 2025

Tensor Fusion Networks (TFN) are multimodal deep learning architectures that use the outer product of modality embeddings to capture explicit unimodal, bimodal, and trimodal interactions.
TFN integrates modality-specific embedding subnetworks, a parameter-free tensor fusion layer, and an inference network to enhance performance on tasks like sentiment analysis and fraud detection.
The end-to-end differentiable design of TFN and its superiority over early fusion methods make it a powerful approach for complex multimodal tasks despite computational challenges in high-dimensional settings.

A Tensor Fusion Network (TFN) is a multimodal deep learning architecture that explicitly and deterministically models intra-modal and inter-modal dynamics by using the outer product of modality embeddings, thereby enumerating all unimodal, bimodal, and, when applicable, trimodal interactions. The TFN framework is fully differentiable, supports end-to-end learning, and has been instantiated for tasks in sentiment analysis as well as for real-time multimodal detection systems. Architecturally, TFN consists of modality-specific embedding subnetworks, a parameter-free Tensor Fusion Layer, and a compact post-fusion inference network. Performance ablations demonstrate TFN's superiority over traditional early-fusion concatenation approaches in capturing complex cross-modal interactions and improving supervised learning efficacy (Zadeh et al., 2017, Wauyo et al., 2 Oct 2025).

1. Architectural Foundation

TFN is organized into three principal components: modality embedding subnetworks, the outer product fusion tensor, and the downstream inference subnetwork.

Modality Embedding Subnetworks: Each input modality is processed by a specialized subnetwork designed to capture intra-modality structures. For spoken-language, this is typically a single-layer LSTM that produces a fixed-dimensional embedding. For visual and acoustic modalities, deep fully-connected networks are used after temporal pooling on raw frame-level features. In (Zadeh et al., 2017), input sequences are embedded as follows:
- Spoken-Language: $\mathbf{z}^l = \mathcal{U}_l(\mathbf{l};\,W_l) \in \mathbb{R}^{128}$
- Visual: $\mathbf{z}^v = \mathcal{U}_v(\mathbf{v};\,W_v) \in \mathbb{R}^{32}$
- Acoustic: $\mathbf{z}^a = \mathcal{U}_a(\mathbf{a};\,W_a) \in \mathbb{R}^{32}$
Fusion Tensor Construction: The outer product of the modality embedding vectors (augmented with 1 for bias) forms a high-order tensor expressing every unimodal, bimodal, and trimodal interaction:

$\mathbf{z}^m = [\mathbf{z}^l; 1] \otimes [\mathbf{z}^v; 1] \otimes [\mathbf{z}^a; 1]$

yielding $\mathbf{z}^m \in \mathbb{R}^{129 \times 33 \times 33}$ , where each block represents either unimodal, bimodal, or trimodal contributions (Zadeh et al., 2017). In applications restricted to two modalities, such as video and audio, the 2-fold outer product forms $T = \bar{z}_v \otimes \bar{z}_a \in \mathbb{R}^{65 \times 33}$ (Wauyo et al., 2 Oct 2025).

Inference Subnetwork: The flattened fusion tensor (dimensionality $\sim 140,\!000$ in the trimodal case) is passed through a two- or three-layer feedforward network to produce final predictions. This network is responsible for learning which interactions are informative for the target task.

2. Mathematical Formalism of the Tensor Fusion Layer

The Tensor Fusion Layer forms the core innovation of the TFN. Instead of early fusion (simple concatenation), TFN computes the full outer product of augmented embeddings:

$\mathbf{z}^m = [\mathbf{z}^l; 1] \otimes [\mathbf{z}^v; 1] \otimes [\mathbf{z}^a; 1]$

This operation enumerates all combinations, incorporating:

1D Subtensors: Purely unimodal terms.
2D Subtensors: All pairwise cross-modal (bimodal) interactions.
3D Subtensor: Full trimodal interaction.
Bias Slots: 1s appended to each embedding enable inclusion of lower-order interactions (editor's term: "bias-augmented tensor fusion").

All entries are parameter-free, and, because the outer product is used, all cross-modality interactions remain explicit and interpretable. This structure is fully differentiable and allows the subsequent network to attend to any subset of these interactions (Zadeh et al., 2017, Wauyo et al., 2 Oct 2025).

3. Training Procedure and Optimization

TFN is trained end-to-end using stochastic gradient descent variants, with all components differentiable. Training details from representative works include:

Input and Preprocessing: Modality inputs are preprocessed into segment-level embeddings via domain-specific pipelines (e.g., GloVe embeddings for language, FACET/OpenFace for vision, COVAREP for audio).
Regularization: Dropout on all hidden layers; $L_2$ weight regularization.
Optimization: Adam or AdamW optimizers with tuned learning rates (e.g., $5\times10^{-4}$ in (Zadeh et al., 2017), $1\times10^{-4}$ with cosine annealing in (Wauyo et al., 2 Oct 2025)).
Loss Functions: Depending on task—binary cross-entropy for binary classification, categorical cross-entropy for n-way classification, mean squared error for regression (Zadeh et al., 2017, Wauyo et al., 2 Oct 2025).
Training Regime: 5-fold cross-validation with speaker independence when appropriate; early stopping on validation loss.

4. Empirical Results and Ablation Analyses

TFN consistently demonstrates superior empirical performance over early-fusion and other traditional fusion strategies.

Model/Setting	Binary Acc.	5-class Acc.	MAE	Correlation ( $r$ )	Recall (%)	F1 (%)
TFN (trimodal)	77.1% (Zadeh et al., 2017)	42.0%	0.87	0.70	84.0 (Wauyo et al., 2 Oct 2025)	85.6
Prior SOTA	73.1%	35.3%	1.10	0.53	75.2	78.6
Unimodal only	≈75%	≈38.0%	–	–	76.4	79.8
Bimodal only	–	–	–	–	80.6	83.1

Ablation results indicate notable degradation when any modality is removed or when only lower-order interactions are allowed. Removing the trimodal block or using only early fusion also leads to reduced accuracy. In transportation fraud detection, TFN's explicit modeling of cross-modal dynamics resulted in a F1 score improvement of 7.0% and recall gain of 8.8% over concatenation (Wauyo et al., 2 Oct 2025).

This suggests that the outer-product fusion, despite its high dimensionality, enhances discriminative power by encoding fine-grained inter-modality dependencies.

5. Applications in Multimodal Learning

Originally proposed for multimodal sentiment analysis in online videos (Zadeh et al., 2017), TFN applies to diverse domains requiring detection or classification from heterogeneous data streams:

Sentiment Analysis: On datasets like CMU-MOSI, TFN establishes new state-of-the-art results for both unimodal and multimodal sentiment inference tasks.
Multimodal Fraud Detection: In public transport security, TFN fuses vision (ViViT) and audio (AST) embeddings for real-time fare evasion and fraud detection, achieving 89.5% accuracy, 87.2% precision, and 84.0% recall, outperforming previous systems by significant margins (Wauyo et al., 2 Oct 2025).
General Multimodal Classification: Any task requiring explicit modeling of multimodal signals and their interactions can potentially benefit from TFN's architecture.

6. Computational Considerations and Scalability

The main computational burden introduced by TFN lies in the dimensionality of the fusion tensor, which grows multiplicatively with the embedding sizes of the constituent modalities. For practical applications:

Memory and Speed: In (Wauyo et al., 2 Oct 2025), the fusion tensor for vision ( $d_v=64$ ) and audio ( $d_a=32$ ) yields $\mathbb{R}^{65 \times 33}$ , requiring only $2,\!048$ multiplies, negligible compared to backbone network cost. This architecture enables near-edge, real-time inference at $\sim$ 98 ms/sample (about 10 FPS) with a memory footprint of 156 MB.
End-to-End Differentiability: No additional parameters are introduced by the fusion layer itself; all parameter learning occurs in the modality encoders and the post-fusion network.
Scalability: In high-modal or high-dimensional settings, the size of the fusion tensor may become a bottleneck, motivating further research into low-rank approximations or structured pruning (not addressed explicitly in the given data).

7. Limitations, Interpretations, and Outlook

TFN explicitly models all order- $N$ cross-modality interactions via outer product expansion. The deterministic, parameter-free nature of the fusion tensor provides interpretability and direct access to unimodal, bimodal, and trimodal effects. Performance ablations consistently demonstrate that higher-order interaction blocks are critical for achieving optimal results. However, as the number of modalities or embedding dimensions increases, the size of the fusion tensor scales exponentially, posing memory and computational challenges.

A plausible implication is that, although TFN offers comprehensive interaction modeling and state-of-the-art results in moderate-scale multimodal tasks, further research into fusion tensor compression or structured regularization may be necessary for very high-dimensional or large-scale applications.

References:

[Tensor Fusion Network for Multimodal Sentiment Analysis, (Zadeh et al., 2017)] [Towards fairer public transit: Real-time tensor-based multimodal fare evasion and fraud detection, (Wauyo et al., 2 Oct 2025)]