MFMC: Multimodal Functional Maximum Correlation
- The paper introduces MFMC, a novel self-supervised framework that maximizes higher-order statistical dependence among coordinated physiological modalities.
- It leverages a trace-based surrogate for functional maximum correlation analysis, ensuring stable and efficient optimization without traditional pairwise contrastive methods.
- Unified encoder architectures and benchmark results on DEAP, CEAP-360VR, and MAHNOB-HCI datasets validate MFMC’s superior performance in both subject-dependent and subject-independent emotion recognition.
Multimodal Functional Maximum Correlation (MFMC) is a self-supervised learning (SSL) framework for multimodal time-series representation learning, designed to directly maximize higher-order statistical dependence among coordinated physiological modalities. In the context of affective computing, MFMC leverages the Dual Total Correlation (DTC) to capture both unique and synergistic information inherent in the joint dynamics of modalities such as EEG, ECG, EDA, and skin temperature, without relying on pairwise contrastive objectives or negative sample construction. MFMC achieves this through a trace-based surrogate for functional maximum correlation analysis (FMCA), enabling stable, efficient optimization of joint mutual information terms and yielding state-of-the-art results in both subject-dependent and subject-independent emotion recognition tasks (Zheng et al., 28 Dec 2025).
1. Foundation: Multimodal Dependence and Information-Theoretic Objectives
Emotion recognition from physiological signals necessitates the modeling of heterogeneous but interdependent responses distributed across central and autonomic systems. Each modality provides temporally resolved, complementary information, but labeled affective annotations are scarce and subjective, motivating SSL approaches. Traditional objectives—e.g., InfoNCE, Barlow Twins—typically align modality pairs and therefore fail to characterize higher-order, synchronous, and asynchronous joint dependencies elicited by emotional states.
MFMC is predicated on maximizing DTC, defined as
where denotes the th modality and is the entropy. Unlike Total Correlation (TC), which measures redundancy,
DTC quantifies synergistic and unique contributions, crucial for capturing coordinated physiological responses.
MFMC seeks to maximize joint mutual information via DTC, sidestepping the over-counting endemic to TC and the pairwise limitations of existing SSL alignments.
2. Tight Sandwich Bound and FMCA Surrogate
Estimating DTC in high-dimensional, continuous settings is computationally infeasible. MFMC leverages a tight information-theoretic sandwich bound for the case : with .
Consequently, maximizing the sum of tri-modal mutual informations approximates DTC up to a scaling constant.
FMCA provides a functional decomposition and trace-based surrogate for the required joint mutual information terms, implemented as: where are feature covariance matrices and is the cross-covariance. For the multimodal setting: where and are derived from joint embeddings via learned fusion networks.
The trace surrogate avoids the numerical instability of log-determinant objectives and the computational cost of eigenvalue decompositions, ensuring robust and efficient optimization.
3. Model Architecture and Implementation
MFMC uses unified backbone encoders per modality to preclude architecture-induced confounds and facilitate fair comparisons across SSL objectives.
- Temporal CNNs: Four 1D depth-wise convolutional blocks per channel (kernel size ), each succeeded by BatchNorm, ReLU, and stride-4 max-pooling, resulting in 256-fold downsampling for 10 s windows at 128 Hz.
- Channel Fusion MLPs: For multichannel modalities, features are concatenated and passed through a MLP. Univariate channels (e.g. SKT) are decoded directly.
- Joint Modality Fusion: Paired embeddings processed by MLP to yield .
- Projection Heads: Three dedicated 2-layer MLPs () compute embeddings for FMCA losses.
- Downstream Classifier: Encoder outputs feed a 4-class MLP for valence–arousal quadrant classification, with frozen encoder during evaluation.
The training process comprises SSL pretraining (for encoder and fusion networks) via the trace-based MFMC loss, followed by linear evaluation using cross-entropy loss.
4. Benchmarking: Protocols, Results, and Ablations
MFMC is quantitatively evaluated on DEAP, CEAP-360VR, and MAHNOB-HCI datasets, spanning multimodal signals and rigorous subject-dependent and subject-independent protocols.
| Dataset | Protocol | Metric | MFMC Performance |
|---|---|---|---|
| DEAP (EEG,EOG,SKT) | SD-Dep / SD-Indep | Accuracy | 0.987 ± 0.009 / 0.346 ± 0.030 |
| CEAP-360VR (EDA,BVP,SKT) | SD-Dep / SD-Indep | Accuracy | 0.868 ± 0.003 / 0.331 ± 0.020 |
| MAHNOB-HCI (EEG,ECG,EDA) | SD-Dep / SD-Indep | Accuracy | 0.953 ± 0.003 / 0.442 ± 0.024 |
MFMC strictly outperforms all pairwise-based SSL baselines (CLIP, FMCA) and TC-based multi-way bounds (SymILE, CLIP++), and matches or exceeds supervised HyperFuseNet in subject-independent splits, despite leveraging no labels during pretraining. Representative ablation results on DEAP (SD-Dep) are:
| Objective | Accuracy (%) |
|---|---|
| MFMC (trace) | 98.7 |
| High-order InfoNCE | 97.6 |
| FMCA (LogDet) | 86.3 |
Application of InfoNCE to higher-order MI plateaus below MFMC; log-determinant FMCA is numerically unstable. MFMC exhibits stable convergence and high accuracy with the trace surrogate.
5. Practical Considerations for Reproducibility
Effective replication of MFMC results requires attention to signal windowing, modality selection, optimizer settings, and covariance regularization:
- Windowing: 10 s windows at 128 Hz, 40% overlap. Macro-F1 performance peaks at 10 s in parameter sweeps.
- Modality Selection: Learnable attention masks facilitate optimal modality inclusion per dataset (DEAP: EEG,EOG,SKT; CEAP: EDA,BVP,SKT; MAHNOB: EEG,ECG,EDA).
- Batch Size: 200 windows per SSL minibatch; same or reduced batch sizes for downstream.
- Optimization: Adam with learning rate , . No scheduler.
- Covariance Regularization: Add to diagonals of covariance matrices prior to inversion.
- Data Augmentation: None used for MFMC. Baselines employ Gaussian noise, temporal shifts, and channel dropout.
- Hardware: Single NVIDIA L40S (48 GB), 5 hours for full 5-fold DEAP training and evaluation.
Official code, notebooks, and data scripts are public at https://github.com/DY9910/MFMC.
6. Extensions, Generalization, and Context
MFMC is not specific to emotion recognition but constitutes a general SSL framework for multimodal time-series analysis, suitable for tasks such as sleep staging, mental health biomarker inference, and BCI applications. Additional modalities (e.g., camera, audio, eye-tracking, respiratory belt) can be integrated by extending DTC sandwich bounds and encoder/fusion modules.
Post-hoc analysis of MFMC’s learned representations and covariance spectra allows quantification of synergy within modality combinations, providing practical insight into physiological dependence. MFMC features can be fine-tuned in semi-supervised or domain-adaptation settings for cross-device or population domain shift.
This approach provides the first augmentation-light SSL methodology for capturing higher-order multimodal interactions via DTC-grounded, trace-based surrogates, yielding robust and high-fidelity representations for a range of temporal, physiological, and distributed sensing applications (Zheng et al., 28 Dec 2025).