VARMAT: Vulnerability-Aware Robust Multimodal Training

Updated 29 November 2025

VARMAT quantifies modality-specific vulnerabilities using gradient norms and temperature-scaled weights to direct adversarial perturbations.
It implements targeted regularization and efficient perturbation schemes to balance robust training with computational efficiency.
The framework employs specialized fusion strategies, including expert gating and detection modules, to maintain resilience across heterogeneous modalities.

Vulnerability-Aware Robust Multimodal Adversarial Training (VARMAT) is an adversarial training framework designed to enhance the robustness of multimodal neural networks against adversarial attacks by explicitly modeling and addressing modality-specific vulnerabilities. VARMAT’s core principle is to quantify, monitor, and penalize the sensitivity (vulnerability) of each input modality, thereby achieving a more balanced and resilient fusion of heterogeneous modalities under adversarial threat. The framework has been proposed and developed in several variations, each contributing novel mechanisms for vulnerability estimation, targeted regularization, and efficient adversarial example generation across modalities (Zhang et al., 22 Nov 2025, zhang et al., 17 Feb 2025, Yang et al., 2022).

1. Theoretical Foundations and Problem Definition

Multimodal models, integrating $M$ heterogeneous modalities $\mathrm{x} = \{x_1, x_2, \dots, x_M\}$ , are prone to adversarial weaknesses arising from unbalanced information contribution and inter-modal dependencies. Traditionally, adversarial training applies a min–max risk: $\min_{\theta} \mathbb{E}_{(\mathrm{x}, y)}\left[ \max_{\|\delta\|_{p}\leq\epsilon} \mathcal{L}(f_\theta(\mathrm{x}+\delta), y) \right]$ where each $\delta_m$ is a perturbation to $x_m$ within $\ell_p$ -budget $\epsilon_m$ (Zhang et al., 22 Nov 2025).

However, different modalities exhibit distinct vulnerabilities, formalized through modality-specific vulnerability indicators. For feature-space attacks, a first-order expansion yields a per-modality vulnerability score: $V_m := \|\nabla_{x_{m}}\mathcal{L}(f_\theta(\mathrm{x}),y)\|_F\,\|x_m\|_F$ The normalized vulnerability weights,

$w_m = \frac{\exp(V_m/T)}{\sum_{k=1}^{M} \exp(V_k/T)},$

capture the relative risk associated with each modality.

Key theoretical work has shown that multimodal certified robustness is governed by the uni-modal representation margin and the reliability of integration, making models disproportionately vulnerable to attacks on highly weighted ("preferred") modalities (Yang et al., 9 Feb 2024). This motivates explicit monitoring and regularization based on vulnerability estimation.

2. Vulnerability Quantification and Probing

VARMAT introduces an in-training "probe" step to quantify the gradient-based vulnerability of each modality before adversarial example generation. For each modality, its vulnerability is approximated by the product $\|g_m\|_F\,\|x_m\|_F$ , with $g_m := \nabla_{x_{m}}\mathcal{L}(f_\theta(\mathrm{x}), y)$ . This quantification directly informs both attack generation and regularization:

Single-step adversarial direction: Perturbation $\delta_m = \frac{g_m}{\|g_m\|_F} \epsilon_m$ aligns with maximal loss increase.
Vulnerability-driven weights: Modality weights $w_m$ influence the regularization and, when used within perturbation crafting, tailor adversarial strength to the most sensitive modalities (Zhang et al., 22 Nov 2025).

3. Targeted Regularization and Loss Functions

VARMAT employs a regularized adversarial training loss tailored to modality-specific vulnerabilities: $\mathcal{L}_{\text{total}} = \mathcal{L}(f_\theta(\mathrm{x}+\delta^*), y) + \beta \sum_{m=1}^{M} \|\nabla_{x_{m}}\mathcal{L}(f_\theta(\mathrm{x}), y)\|_F$ where $\delta^*$ denotes the adversarial perturbation (feature or input space) constructed via gradient-based methods (e.g., PGD, FGSM-RS). The regularizer penalizes high vulnerability, guiding the model to balance robustness across modalities and targeting the most susceptible channels (Zhang et al., 22 Nov 2025).

Tuning the regularization coefficient $\beta$ is dataset-dependent. For example, $\beta = 1000$ on CMU-MOSEI and $\beta = 50$ on UR-FUNNY and AVMNIST, determined empirically by the scale of vulnerability indicators (Zhang et al., 22 Nov 2025).

4. Fusion Strategies and Detection Mechanisms

Extended VARMAT variants use architectural mechanisms to counteract single-source attacks:

Inconsistency-Detection Module: An "odd-one-out" network $o(z^1, ..., z^M) \in \Delta^{M+1}$ predicts which modality is inconsistent. The detection head is trained using $(M+1)$ -way cross-entropy over clean and perturbed features.
Expert Gating: $M+1$ fusion experts $e_j$ construct fused features excluding each modality in turn. Fusion is a weighted mixture using $o(\cdot)$ 's softmax probabilities to filter out suspected corrupted sources: $z_{\text{fused}} = \sum_{j=1}^{M+1} e_j(z^1, \dots, z^M) \cdot o(z^1, \dots, z^M)_j$
Joint Training: The robust model is trained with a loss combining clean accuracy, adversarial robustness (perturbing each modality separately), and detection accuracy (Yang et al., 2022).

This strategy specifically targets single-modality attacks and relies on feature-level gating to prevent contaminated information from dominating fusion.

5. Efficient Adversarial Example Generation

Standard full-modal PGD is computationally expensive for long sequences and high-dimensional modalities. Recent VARMAT implementations introduce efficient perturbation schemes:

Segmented Temporal Crafting (for audio-visual streams): Clips are partitioned into segments, and only a random $\rho_x$ -fraction of frames per segment are used to craft universal perturbations (broadcast to the segment). This reduces gradient computations while maintaining attack effectiveness.
Vulnerability-Aware Losses: Attack objectives specifically target temporal-invariant features and modality misalignment via variance and cosine similarity losses in the inner maximization (zhang et al., 17 Feb 2025).
Adversarial Curriculum: Sampling ratios $\rho_x$ and dropout rates $\rho_f$ are smoothly scheduled (e.g., cosine annealing) to avoid overfitting to a single perturbation regime.

These methods yield significant speed-ups (up to $3.7\times$ faster than vanilla AT) and improved robustness (zhang et al., 17 Feb 2025).

6. Empirical Results and Comparative Evaluation

VARMAT achieves consistently superior adversarial robustness across multiple datasets and benchmarks. Key findings include:

Dataset	Clean Acc. (base/VARMAT)	Adversarial Robustness (base/VARMAT)	Robustness Gain
CMU-MOSEI	78.83 / 79.15	55.76 / 68.49 (PGD, $\lambda=0.5$ )	+12.73%
UR-FUNNY	68.10 / 66.76	35.82 / 58.03 (V-PGD, $\lambda=0.5$ )	+22.21%
AVMNIST	67.72 / 62.16	1.55 / 12.74 (V-FGSM, $\lambda=0.5$ )	+11.19%

On Kinetics-Sounds with white-box TMA attack, VARMAT (frame-sampling $\rho_x=15\%$ ) reaches 81.5% defense accuracy, outperforming CRMT-AT (79.2%) and vanilla AT (76.8%), with significantly reduced training time (zhang et al., 17 Feb 2025).

Compared to other strategies, such as Certifiable Robust Multi-modal Training (CRMT), which explicitly balances uni-modal margins and integration factors (Yang et al., 9 Feb 2024), VARMAT provides both practical efficiency and enhanced robustness through adaptive probing and targeted regularization.

7. Limitations and Future Directions

Current VARMAT frameworks primarily address single-source or low simultaneous modality attacks. The expert gating approach scales combinatorially with the number of simultaneously attacked modalities, motivating research into hierarchical or randomized aggregation mechanisms for broader threat models (Yang et al., 2022). VARMAT’s hyperparameters, notably $\beta$ and softmax temperature, require dataset-specific tuning.

Extensions include:

Automated tuning of regularization parameters.
Multi-step adversarial inner maximization (PGD-based training loops).
Application to generative or diffusion-based multimodal architectures.
Incorporation of certifiable bounds and explicit regularization of inter-modality integration, as pioneered by CRMT (Yang et al., 9 Feb 2024).

Empirical and theoretical evidence supports the central thesis of VARMAT: that explicit vulnerability quantification and regularization drive substantial improvements in multimodal adversarial robustness without compromising clean-data accuracy (Zhang et al., 22 Nov 2025, zhang et al., 17 Feb 2025, Yang et al., 2022).