Papers
Topics
Authors
Recent
2000 character limit reached

MFMC: Multimodal Functional Maximum Correlation

Updated 4 January 2026
  • The paper introduces MFMC, a novel self-supervised framework that maximizes higher-order statistical dependence among coordinated physiological modalities.
  • It leverages a trace-based surrogate for functional maximum correlation analysis, ensuring stable and efficient optimization without traditional pairwise contrastive methods.
  • Unified encoder architectures and benchmark results on DEAP, CEAP-360VR, and MAHNOB-HCI datasets validate MFMC’s superior performance in both subject-dependent and subject-independent emotion recognition.

Multimodal Functional Maximum Correlation (MFMC) is a self-supervised learning (SSL) framework for multimodal time-series representation learning, designed to directly maximize higher-order statistical dependence among MM coordinated physiological modalities. In the context of affective computing, MFMC leverages the Dual Total Correlation (DTC) to capture both unique and synergistic information inherent in the joint dynamics of modalities such as EEG, ECG, EDA, and skin temperature, without relying on pairwise contrastive objectives or negative sample construction. MFMC achieves this through a trace-based surrogate for functional maximum correlation analysis (FMCA), enabling stable, efficient optimization of joint mutual information terms and yielding state-of-the-art results in both subject-dependent and subject-independent emotion recognition tasks (Zheng et al., 28 Dec 2025).

1. Foundation: Multimodal Dependence and Information-Theoretic Objectives

Emotion recognition from physiological signals necessitates the modeling of heterogeneous but interdependent responses distributed across central and autonomic systems. Each modality provides temporally resolved, complementary information, but labeled affective annotations are scarce and subjective, motivating SSL approaches. Traditional objectives—e.g., InfoNCE, Barlow Twins—typically align modality pairs and therefore fail to characterize higher-order, synchronous, and asynchronous joint dependencies elicited by emotional states.

MFMC is predicated on maximizing DTC, defined as

DTC(X1,…,XM)=H(X1,…,XM)−∑i=1MH(Xi∣X[M]∖i),\mathrm{DTC}(X_1,\dots,X_M) = H(X_1,\dots,X_M) - \sum_{i=1}^M H(X_i|X_{[M] \setminus i}),

where XiX_i denotes the iith modality and H(â‹…)H(\cdot) is the entropy. Unlike Total Correlation (TC), which measures redundancy,

TC(X[M])=∑i=1MH(Xi)−H(X1,…,XM),\mathrm{TC}(X_{[M]}) = \sum_{i=1}^M H(X_i) - H(X_1,\dots,X_M),

DTC quantifies synergistic and unique contributions, crucial for capturing coordinated physiological responses.

MFMC seeks to maximize joint mutual information via DTC, sidestepping the over-counting endemic to TC and the pairwise limitations of existing SSL alignments.

2. Tight Sandwich Bound and FMCA Surrogate

Estimating DTC in high-dimensional, continuous settings is computationally infeasible. MFMC leverages a tight information-theoretic sandwich bound for the case M=3M=3: 13∑cycI(pair;third)≤DTC(X1,X2,X3)≤23∑cycI(pair;third),\frac{1}{3}\sum_{\text{cyc}} I(\text{pair}; \text{third}) \leq \mathrm{DTC}(X_1,X_2,X_3) \leq \frac{2}{3}\sum_{\text{cyc}} I(\text{pair}; \text{third}), with ∑cycI(pair;third)=I(X1,X2;X3)+I(X2,X3;X1)+I(X3,X1;X2)\sum_{\text{cyc}} I(\text{pair}; \text{third}) = I(X_1,X_2; X_3) + I(X_2,X_3; X_1) + I(X_3,X_1; X_2).

Consequently, maximizing the sum of tri-modal mutual informations approximates DTC up to a scaling constant.

FMCA provides a functional decomposition and trace-based surrogate for the required joint mutual information terms, implemented as: LFMCA(X,Y)=−tr(RX−1PXYRY−1PXY⊤),\mathcal{L}_{\rm FMCA}(X,Y) = -\mathrm{tr}(R_X^{-1}P_{XY} R_Y^{-1} P_{XY}^\top), where RX,RYR_X, R_Y are feature covariance matrices and PXYP_{XY} is the cross-covariance. For the multimodal setting: LMFMC=−∑cyctr(Rij−1Pij,kRk−1Pij,k⊤),\mathcal{L}_{\rm MFMC} = -\sum_{\text{cyc}} \mathrm{tr}(R_{ij}^{-1} P_{ij,k} R_k^{-1} P_{ij,k}^\top), where RijR_{ij} and Pij,kP_{ij,k} are derived from joint embeddings via learned fusion networks.

The trace surrogate avoids the numerical instability of log-determinant objectives and the computational cost of eigenvalue decompositions, ensuring robust and efficient optimization.

3. Model Architecture and Implementation

MFMC uses unified backbone encoders per modality to preclude architecture-induced confounds and facilitate fair comparisons across SSL objectives.

  • Temporal CNNs: Four 1D depth-wise convolutional blocks per channel (kernel size =11=11), each succeeded by BatchNorm, ReLU, and stride-4 max-pooling, resulting in 256-fold downsampling for 10 s windows at 128 Hz.
  • Channel Fusion MLPs: For multichannel modalities, features are concatenated and passed through a (C×F)→4000→128(C\times F) \rightarrow 4000 \rightarrow 128 MLP. Univariate channels (e.g. SKT) are decoded directly.
  • Joint Modality Fusion: Paired embeddings [ei,ej]∈R256[e_i, e_j] \in \mathbb{R}^{256} processed by 256→128256 \rightarrow 128 MLP to yield eije_{ij}.
  • Projection Heads: Three dedicated 2-layer MLPs (128→512→128128 \rightarrow 512 \rightarrow 128) compute embeddings for FMCA losses.
  • Downstream Classifier: Encoder outputs feed a 4-class MLP for valence–arousal quadrant classification, with frozen encoder during evaluation.

The training process comprises SSL pretraining (for encoder and fusion networks) via the trace-based MFMC loss, followed by linear evaluation using cross-entropy loss.

4. Benchmarking: Protocols, Results, and Ablations

MFMC is quantitatively evaluated on DEAP, CEAP-360VR, and MAHNOB-HCI datasets, spanning multimodal signals and rigorous subject-dependent and subject-independent protocols.

Dataset Protocol Metric MFMC Performance
DEAP (EEG,EOG,SKT) SD-Dep / SD-Indep Accuracy 0.987 ± 0.009 / 0.346 ± 0.030
CEAP-360VR (EDA,BVP,SKT) SD-Dep / SD-Indep Accuracy 0.868 ± 0.003 / 0.331 ± 0.020
MAHNOB-HCI (EEG,ECG,EDA) SD-Dep / SD-Indep Accuracy 0.953 ± 0.003 / 0.442 ± 0.024

MFMC strictly outperforms all pairwise-based SSL baselines (CLIP, FMCA) and TC-based multi-way bounds (SymILE, CLIP++), and matches or exceeds supervised HyperFuseNet in subject-independent splits, despite leveraging no labels during pretraining. Representative ablation results on DEAP (SD-Dep) are:

Objective Accuracy (%)
MFMC (trace) 98.7
High-order InfoNCE 97.6
FMCA (LogDet) 86.3

Application of InfoNCE to higher-order MI plateaus below MFMC; log-determinant FMCA is numerically unstable. MFMC exhibits stable convergence and high accuracy with the trace surrogate.

5. Practical Considerations for Reproducibility

Effective replication of MFMC results requires attention to signal windowing, modality selection, optimizer settings, and covariance regularization:

  • Windowing: 10 s windows at 128 Hz, 40% overlap. Macro-F1 performance peaks at 10 s in parameter sweeps.
  • Modality Selection: Learnable attention masks facilitate optimal modality inclusion per dataset (DEAP: EEG,EOG,SKT; CEAP: EDA,BVP,SKT; MAHNOB: EEG,ECG,EDA).
  • Batch Size: 200 windows per SSL minibatch; same or reduced batch sizes for downstream.
  • Optimization: Adam with learning rate 3×10−43 \times 10^{-4}, (β1,β2)=(0.5,0.9)(\beta_1, \beta_2)=(0.5,0.9). No scheduler.
  • Covariance Regularization: Add ε=10−4\varepsilon=10^{-4} to diagonals of covariance matrices prior to inversion.
  • Data Augmentation: None used for MFMC. Baselines employ Gaussian noise, temporal shifts, and channel dropout.
  • Hardware: Single NVIDIA L40S (48 GB), ∼\sim5 hours for full 5-fold DEAP training and evaluation.

Official code, notebooks, and data scripts are public at https://github.com/DY9910/MFMC.

6. Extensions, Generalization, and Context

MFMC is not specific to emotion recognition but constitutes a general SSL framework for multimodal time-series analysis, suitable for tasks such as sleep staging, mental health biomarker inference, and BCI applications. Additional modalities (e.g., camera, audio, eye-tracking, respiratory belt) can be integrated by extending DTC sandwich bounds and encoder/fusion modules.

Post-hoc analysis of MFMC’s learned representations and covariance spectra allows quantification of synergy within modality combinations, providing practical insight into physiological dependence. MFMC features can be fine-tuned in semi-supervised or domain-adaptation settings for cross-device or population domain shift.

This approach provides the first augmentation-light SSL methodology for capturing higher-order multimodal interactions via DTC-grounded, trace-based surrogates, yielding robust and high-fidelity representations for a range of temporal, physiological, and distributed sensing applications (Zheng et al., 28 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Multimodal Functional Maximum Correlation (MFMC).