2000 character limit reached

Multi-Modal Task-Oriented Communication

Updated 17 November 2025

The paper presents a multi-stage framework using per-modality variational bottlenecks and adversarial MI minimization to extract and fuse task-relevant features.
It employs deep semantic encoding and channel-aware strategies to compress inputs and ensure robust performance under noisy and resource-constrained conditions.
Empirical evaluations on datasets like CMU-MOSI/MOSEI show notable F1-score improvements and validate its effectiveness against traditional semantic communication models.

A multi-modal task-oriented communication framework is a system that jointly processes heterogeneous input modalities (such as text, voice, video, and audio) for the purpose of transmitting only the information relevant to a specified downstream task (such as emotion recognition, question answering, or control), while operating over resource-constrained and noisy communication channels. The primary objective is to maximize task-relevant accuracy and reliability under stringent rate–distortion–robustness constraints, typically by integrating deep semantic encoding, redundancy-aware representation learning, rigorous cross-modal fusion, and channel-model-aware robustness strategies.

1. Core System Architecture and Redundancy-Aware Fusion

A robust multi-modal task-oriented communication framework operates in a multi-stage fashion. Each modality $\text{m}\in\{\text{i} \ (\text{image}), \text{t} \ (\text{text}), \text{a} \ (\text{audio})\}$ undergoes a pre-trained feature extraction $\delta^m$ , followed by a uni-modal variational information bottleneck (VIB) encoder $\epsilon^m$ to produce stochastic latent representations $Z^m$ that preserve task-relevant semantics while compressing the input features $X^m \in \mathbb{R}^{d_m}$ . The set $\{Z^i, Z^t, Z^a\}$ is concatenated and passed through a redundancy-minimizing fusion module.

The redundancy suppression is implemented by adversarially minimizing pairwise mutual information (MI) between each modality pair (e.g., $I(Z^i;Z^t)$ ). This is accomplished with gradient-reversal layers (GRL) and three discriminators (one per modality pair). After fusion, a multi-modal VIB (M-VIB) encoder further compresses the fused feature $X$ into $Z$ , which is modulated, transmitted over a modeled channel with additive white Gaussian noise (AWGN), and then decoded into the final task prediction $\hat Y$ (Fu et al., 10 Nov 2025):

S^i → δ^i → X^i → ε^i → Z^i ┐
S^t → δ^t → X^t → ε^t → Z^t ┤→ [concat] → GRL → {T_{it},T_{ia},T_{ta}
S^a → δ^a → X^a → ε^a → Z^a ┘
                                  ↓
                                  X → η → Z → channel → \hat Z → υ → \hat Y

This model architecture explicitly factors in both intra-modal compression (via uni-modal VIB) and inter-modal redundancy minimization, so that only complementary and non-redundant information is retained in the final transmitted representation.

2. Mathematical Formulation: Variational Bottleneck and Mutual Information Minimization

For each modality $m$ , the VIB objective is: $\min_{p(z^m|x^m)} I(X^m;Z^m) - \beta_m I(Z^m;Y)$ with tractable implementation: $L_{\text{U-VIB}}^m = \mathbb{E}_{p(x^m)}[D_{KL}(p(z^m|x^m)\|q(z^m))] - \beta_m\,\mathbb{E}_{p(x^m,y)}\left[\mathbb{E}_{p(z^m|x^m)}[\log q(y|z^m)]\right]$

The redundancy loss is a sum of pairwise MI lower-bounded by Jensen–Shannon divergence, estimated via adversarial discriminators: $L_{\mathrm{red}} = \sum_{(p,q)} J_{pq}(Z^p;Z^q)$ where: $J_{pq}(Z^p;Z^q) = \sup_{T_{pq}} \mathbb{E}_{p_{Z^p Z^q}}[\log \sigma(T_{pq}(Z^p,Z^q))] + \mathbb{E}_{p_{Z^p}p_{Z^q}}[\log (1-\sigma(T_{pq}(Z^p,Z^q)))] + 2\log 2$

Adversarial training alternates maximizing $J_{pq}$ (discriminators) and minimizing $J_{pq}$ (encoders), enforced via the GRL.

After fusion,

$\min_{p(\hat z|x)} I(X;\hat Z) - \gamma I(\hat Z;Y)$

with loss

$L_{M\text{-}VIB} = \mathbb{E}_{p(x)}[D_{KL}(p(\hat z|x)\|q(\hat z))] - \gamma\,\mathbb{E}_{p(x,y)}\left[\mathbb{E}_{p(\hat z|x)}[\log q(y|\hat z)]\right]$

2.4 Overall Loss

The total training objective is: $L_{\mathrm{total}} = \sum_{m \in \{i,t,a\}} L_{\text{U-VIB}}^m + \lambda_{\mathrm{red}} L_{\mathrm{red}} + L_{M\text{-}VIB}$

3. Robustness to Channel Conditions and End-to-End Training

The system is trained with an explicit channel model: $\hat Z = h(Z) + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2 I)$ Robustness is imparted by end-to-end training over randomized channel states (SNRs in $[0,21]$ dB), and use of the KL regularization in $L_{M-VIB}$ , which encourages the fused representation to approximate a centered isotropic Gaussian, conferring invariance to channel-induced perturbations.

Algorithmic details include batch-based training (batch size 32, 50 epochs), scheduler-based activation of $\lambda_{\text{red}}$ , linear ramp-up of the GRL factor ( $\alpha$ ), and empirically optimal fused dimension $d_Z{=}50$ , per ablation paper.

4. Performance Evaluation and Empirical Results

The framework was validated on the CMU-MOSI and CMU-MOSEI datasets for multi-modal emotion recognition under AWGN and Rayleigh channels, using Top-2, Top-7 accuracy, F1-score, and mean absolute error (MAE) as evaluation metrics:

Dataset/Channel	SNR (dB)	Model	F1-score (%)	Top-2 Acc (%)	Top-7 Acc (%)
MOSEI/AWGN	–6	RMTOC	79.38	--	44.09
MOSEI/AWGN	–6	T-DeepSC*	65.98	--	43.10
MOSI/Rayleigh	–12	RMTOC	73.71	73.59	--
MOSI/Rayleigh	–12	T-DeepSC*	63.35	63.27	--

*Baseline

The GRL-based MI minimization leads to random-guess discriminator binary cross-entropy (BCE) performance ( $\approx2\ln2$ ), indicating effective cross-modal de-correlation. Increasing the transmission vector dimensionality beyond $d_Z\approx50$ yields diminishing task gains, supporting minimal, non-redundant representation.

Redundancy-aware and information bottleneck-based approaches are distinctive features of this framework. Alternative strategies include:

Importance-aware hierarchical coding: Dynamically weights encoding resources across segments, tokens, and bits based on learned task significance, enabling task-specific rate-distortion targeting (Ma et al., 22 Feb 2025).
Attention-driven and semantic fusion models: Such as those utilizing large multimodal models for query-adaptive patch weighting and selective transmission (e.g., LLaVA-based vehicle assistants) (Du et al., 5 May 2025).
Distributed and multi-agent models: Frameworks incorporating distributed bottleneck selection and probabilistic mode selection to navigate physical and compute limits, extending classical DIB theory to task-coordinate multi-agent setups (Zhou et al., 5 Oct 2025).
Fusion modules (e.g., BERT fusion, Multi-GAT): Task-driven, self-attention-based multimodal fusion with explicit segment or token-level annotation for improved task multiplexing efficiency (Zhu et al., 1 Jul 2024, Guo et al., 18 Jan 2024).

6. Implications and Significance

The two-stage VIB plus adversarial MI minimization framework realizes a tight integration of per-modality compression, redundancy suppression, joint fusion, and noise-robust semantic representation. By optimizing for end-to-end task accuracy rather than channel-level fidelity alone and ensuring only complementary cross-modal information is encoded, it yields SOTA results under real-world channel conditions, with notable gains (13–15% F1 improvement at low SNR) over conventional and prior task-oriented semantic transceivers (Fu et al., 10 Nov 2025).

This principled approach facilitates reliable, bandwidth-efficient communication in multi-modal, resource-constrained, and dynamically adverse wireless environments, providing a robust foundation for semantic tasks that demand real-time, high-accuracy performance.