Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Multi-Modal Task-Oriented Communication

Updated 17 November 2025
  • The paper presents a multi-stage framework using per-modality variational bottlenecks and adversarial MI minimization to extract and fuse task-relevant features.
  • It employs deep semantic encoding and channel-aware strategies to compress inputs and ensure robust performance under noisy and resource-constrained conditions.
  • Empirical evaluations on datasets like CMU-MOSI/MOSEI show notable F1-score improvements and validate its effectiveness against traditional semantic communication models.

A multi-modal task-oriented communication framework is a system that jointly processes heterogeneous input modalities (such as text, voice, video, and audio) for the purpose of transmitting only the information relevant to a specified downstream task (such as emotion recognition, question answering, or control), while operating over resource-constrained and noisy communication channels. The primary objective is to maximize task-relevant accuracy and reliability under stringent rate–distortion–robustness constraints, typically by integrating deep semantic encoding, redundancy-aware representation learning, rigorous cross-modal fusion, and channel-model-aware robustness strategies.

1. Core System Architecture and Redundancy-Aware Fusion

A robust multi-modal task-oriented communication framework operates in a multi-stage fashion. Each modality m{(image),(text),(audio)}\text{m}\in\{\text{i} \ (\text{image}), \text{t} \ (\text{text}), \text{a} \ (\text{audio})\} undergoes a pre-trained feature extraction δm\delta^m, followed by a uni-modal variational information bottleneck (VIB) encoder ϵm\epsilon^m to produce stochastic latent representations ZmZ^m that preserve task-relevant semantics while compressing the input features XmRdmX^m \in \mathbb{R}^{d_m}. The set {Zi,Zt,Za}\{Z^i, Z^t, Z^a\} is concatenated and passed through a redundancy-minimizing fusion module.

The redundancy suppression is implemented by adversarially minimizing pairwise mutual information (MI) between each modality pair (e.g., I(Zi;Zt)I(Z^i;Z^t)). This is accomplished with gradient-reversal layers (GRL) and three discriminators (one per modality pair). After fusion, a multi-modal VIB (M-VIB) encoder further compresses the fused feature XX into ZZ, which is modulated, transmitted over a modeled channel with additive white Gaussian noise (AWGN), and then decoded into the final task prediction Y^\hat Y (Fu et al., 10 Nov 2025):

1
2
3
4
5
S^i → δ^i → X^i → ε^i → Z^i ┐
S^t → δ^t → X^t → ε^t → Z^t ┤→ [concat] → GRL → {T_{it},T_{ia},T_{ta}
S^a → δ^a → X^a → ε^a → Z^a ┘
                                  ↓
                                  X → η → Z → channel → \hat Z → υ → \hat Y

This model architecture explicitly factors in both intra-modal compression (via uni-modal VIB) and inter-modal redundancy minimization, so that only complementary and non-redundant information is retained in the final transmitted representation.

2. Mathematical Formulation: Variational Bottleneck and Mutual Information Minimization

2.1 Uni-modal Compression

For each modality mm, the VIB objective is: minp(zmxm)I(Xm;Zm)βmI(Zm;Y)\min_{p(z^m|x^m)} I(X^m;Z^m) - \beta_m I(Z^m;Y) with tractable implementation: LU-VIBm=Ep(xm)[DKL(p(zmxm)q(zm))]βmEp(xm,y)[Ep(zmxm)[logq(yzm)]]L_{\text{U-VIB}}^m = \mathbb{E}_{p(x^m)}[D_{KL}(p(z^m|x^m)\|q(z^m))] - \beta_m\,\mathbb{E}_{p(x^m,y)}\left[\mathbb{E}_{p(z^m|x^m)}[\log q(y|z^m)]\right]

2.2 Cross-modal Redundancy Suppression

The redundancy loss is a sum of pairwise MI lower-bounded by Jensen–Shannon divergence, estimated via adversarial discriminators: Lred=(p,q)Jpq(Zp;Zq)L_{\mathrm{red}} = \sum_{(p,q)} J_{pq}(Z^p;Z^q) where: Jpq(Zp;Zq)=supTpqEpZpZq[logσ(Tpq(Zp,Zq))]+EpZppZq[log(1σ(Tpq(Zp,Zq)))]+2log2J_{pq}(Z^p;Z^q) = \sup_{T_{pq}} \mathbb{E}_{p_{Z^p Z^q}}[\log \sigma(T_{pq}(Z^p,Z^q))] + \mathbb{E}_{p_{Z^p}p_{Z^q}}[\log (1-\sigma(T_{pq}(Z^p,Z^q)))] + 2\log 2

Adversarial training alternates maximizing JpqJ_{pq} (discriminators) and minimizing JpqJ_{pq} (encoders), enforced via the GRL.

2.3 Multi-modal Joint Compression

After fusion,

minp(z^x)I(X;Z^)γI(Z^;Y)\min_{p(\hat z|x)} I(X;\hat Z) - \gamma I(\hat Z;Y)

with loss

LM-VIB=Ep(x)[DKL(p(z^x)q(z^))]γEp(x,y)[Ep(z^x)[logq(yz^)]]L_{M\text{-}VIB} = \mathbb{E}_{p(x)}[D_{KL}(p(\hat z|x)\|q(\hat z))] - \gamma\,\mathbb{E}_{p(x,y)}\left[\mathbb{E}_{p(\hat z|x)}[\log q(y|\hat z)]\right]

2.4 Overall Loss

The total training objective is: Ltotal=m{i,t,a}LU-VIBm+λredLred+LM-VIBL_{\mathrm{total}} = \sum_{m \in \{i,t,a\}} L_{\text{U-VIB}}^m + \lambda_{\mathrm{red}} L_{\mathrm{red}} + L_{M\text{-}VIB}

3. Robustness to Channel Conditions and End-to-End Training

The system is trained with an explicit channel model: Z^=h(Z)+ε,εN(0,σ2I)\hat Z = h(Z) + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2 I) Robustness is imparted by end-to-end training over randomized channel states (SNRs in [0,21][0,21] dB), and use of the KL regularization in LMVIBL_{M-VIB}, which encourages the fused representation to approximate a centered isotropic Gaussian, conferring invariance to channel-induced perturbations.

Algorithmic details include batch-based training (batch size 32, 50 epochs), scheduler-based activation of λred\lambda_{\text{red}}, linear ramp-up of the GRL factor (α\alpha), and empirically optimal fused dimension dZ=50d_Z{=}50, per ablation paper.

4. Performance Evaluation and Empirical Results

The framework was validated on the CMU-MOSI and CMU-MOSEI datasets for multi-modal emotion recognition under AWGN and Rayleigh channels, using Top-2, Top-7 accuracy, F1-score, and mean absolute error (MAE) as evaluation metrics:

Dataset/Channel SNR (dB) Model F1-score (%) Top-2 Acc (%) Top-7 Acc (%)
MOSEI/AWGN –6 RMTOC 79.38 -- 44.09
MOSEI/AWGN –6 T-DeepSC* 65.98 -- 43.10
MOSI/Rayleigh –12 RMTOC 73.71 73.59 --
MOSI/Rayleigh –12 T-DeepSC* 63.35 63.27 --

*Baseline

The GRL-based MI minimization leads to random-guess discriminator binary cross-entropy (BCE) performance (2ln2\approx2\ln2), indicating effective cross-modal de-correlation. Increasing the transmission vector dimensionality beyond dZ50d_Z\approx50 yields diminishing task gains, supporting minimal, non-redundant representation.

Redundancy-aware and information bottleneck-based approaches are distinctive features of this framework. Alternative strategies include:

  • Importance-aware hierarchical coding: Dynamically weights encoding resources across segments, tokens, and bits based on learned task significance, enabling task-specific rate-distortion targeting (Ma et al., 22 Feb 2025).
  • Attention-driven and semantic fusion models: Such as those utilizing large multimodal models for query-adaptive patch weighting and selective transmission (e.g., LLaVA-based vehicle assistants) (Du et al., 5 May 2025).
  • Distributed and multi-agent models: Frameworks incorporating distributed bottleneck selection and probabilistic mode selection to navigate physical and compute limits, extending classical DIB theory to task-coordinate multi-agent setups (Zhou et al., 5 Oct 2025).
  • Fusion modules (e.g., BERT fusion, Multi-GAT): Task-driven, self-attention-based multimodal fusion with explicit segment or token-level annotation for improved task multiplexing efficiency (Zhu et al., 1 Jul 2024, Guo et al., 18 Jan 2024).

6. Implications and Significance

The two-stage VIB plus adversarial MI minimization framework realizes a tight integration of per-modality compression, redundancy suppression, joint fusion, and noise-robust semantic representation. By optimizing for end-to-end task accuracy rather than channel-level fidelity alone and ensuring only complementary cross-modal information is encoded, it yields SOTA results under real-world channel conditions, with notable gains (13–15% F1 improvement at low SNR) over conventional and prior task-oriented semantic transceivers (Fu et al., 10 Nov 2025).

This principled approach facilitates reliable, bandwidth-efficient communication in multi-modal, resource-constrained, and dynamically adverse wireless environments, providing a robust foundation for semantic tasks that demand real-time, high-accuracy performance.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multi-Modal Task-Oriented Communication Framework.