Joint Temporal Lipschitz AT (J-TLAT)

Updated 8 February 2026

The paper introduces J-TLAT’s novel defense protocol that jointly targets router and expert vulnerabilities using temporally-adaptive, Lipschitz-guided attacks.
It employs a hierarchical three-stage training cycle integrating component-specific and joint adversarial objectives to boost robustness under white-box attacks.
Empirical results on UCF-101 and HMDB-51 demonstrate significant improvements in robust accuracy and efficiency over traditional MoE adversarial defenses.

Joint Temporal Lipschitz Adversarial Training (J-TLAT) is a defense framework for video Mixture-of-Experts (MoE) models that targets the collaborative vulnerabilities between routers and expert modules, particularly against adversaries exploiting temporal dependencies. J-TLAT systematically combines component-specific and joint adversarial objectives using temporally-adaptive, Lipschitz-guided attacks and a hierarchical adversarial training protocol. This approach enhances model robustness to sophisticated white-box attacks while preserving the efficiency advantages of sparsely-routed MoE architectures (Wang et al., 1 Feb 2026).

1. Vulnerabilities in Video Mixture-of-Experts

Video MoE models use a lightweight router $R(\cdot)$ to select, for each frame or video segment, a small subset of specialized “experts” from a pool, maintaining high capacity at low inference cost. The temporal dimension introduces cumulating effects: minor perturbations in early frames propagate and amplify, exposing two critical attack surfaces—router selection and expert response—as well as their interaction. Standard attacks (PGD, AutoAttack) consider the MoE as a single composite model, maximizing cross-entropy loss $L_\mathrm{CE}(F_\mathrm{MoE}(x+\delta),y)$ under norm constraints, but neglect the component-wise structure. Attackers can destabilize the system by:

Forcing the router towards its lowest-confidence experts, causing a collapse in routing diversity.
Directly perturbing weak experts, exposing their individual vulnerabilities.
Orchestrating mis-routing and expert degradation simultaneously.

Video-specific adversarial methods (e.g., sparse 3D-perturbations, key-frame attacks, time-frequency methods) optimize perturbations with respect to spatio-temporal budgets, but do not directly exploit the MoE structure and fail to induce the collaborative failure modes that arise when the router and its weakest experts are jointly targeted.

2. Temporal Lipschitz-Guided Attacks (TLGA) Formulation

Temporal Lipschitz-Guided Attacks introduce two principal tools:

Lipschitz Ratio Regularization: For any differentiable mapping $g(\cdot)$ (router, expert, or MoE), the local Lipschitz loss is defined as

$L_\mathrm{Lip}[g; x, \delta] = \frac{\|g(x+\delta)-g(x)\|_2^2}{\|\delta\|_2^2}$

Maximizing this loss seeks out directions where small input changes provoke maximal output deviation, pinpointing adversarially sensitive directions.

Temporal Adaptive Step-Size: Per-frame gradient norms are accumulated to inform a momentum vector $V = (V_1,\ldots,V_T)$ , and the update step for each frame is

$V_t \leftarrow \mu V_t + \|\nabla_{x_t} L\|_2, \quad a_t \leftarrow a(1 + \log(1 + V_t))$

Perturbations on more sensitive frames are amplified, enabling more efficient attacks.

The primary TLGA variants and their optimization objectives are outlined in the following table:

Component	Loss Function	Constraint
Router (TLGA-R)	$L_\mathrm{R}(x;\delta_\mathrm{R}) = L_\mathrm{CE}(R(x+\delta_\mathrm{R}),y_\mathrm{R}) + \alpha L_\mathrm{Lip}[R; x, \delta_\mathrm{R}]$	$\\|\delta_\mathrm{R}\\|_p \le \epsilon_\mathrm{R}$
Expert (TLA-E)	$L_\mathrm{E}(x;\delta_{E_i}) = L_\mathrm{CE}(E_i(x+\delta_{E_i}),y) + \alpha L_\mathrm{Lip}[E_i; x, \delta_{E_i}]$	$\\|\delta_{E_i}\\|_p \le \epsilon_\mathrm{E}$
Mixture (TLA-M)	$L_\mathrm{MoE}(x;\delta_\mathrm{M}) = L_\mathrm{CE}(F(x+\delta_\mathrm{M}),y) + \alpha L_\mathrm{Lip}[F; x, \delta_\mathrm{M}]$	$\\|\delta_\mathrm{M}\\|_p \le \epsilon_\mathrm{M}$

This structure enables fine-grained manipulation of the MoE model, beyond conventional attacks, by focusing on and amplifying per-component weaknesses.

3. Joint Temporal Lipschitz-Guided Attacks (J-TLGA)

J-TLGA constructs a unified adversarial perturbation $\delta$ that simultaneously:

Drives the router $R$ to select its least robust experts (mis-routing).
Maximally degrades those selected experts’ performance.
Causes misclassification at the final MoE output.

The joint loss maximized by J-TLGA is:

$L_\mathrm{joint}(x; \delta) = L_\mathrm{CE}(F(x+\delta), y) + \beta L_\mathrm{Lip}[F; x, \delta] + \gamma L_\mathrm{CE}(R(x+\delta), y_\mathrm{R}) + \gamma L_\mathrm{Lip}[R; x, \delta] + \sum_{i \in I} \left( L_\mathrm{CE}(E_i(x+\delta), y) + \alpha L_\mathrm{Lip}[E_i; x, \delta] \right)$

where $I$ indexes the weakest experts as determined by the router's output on the adversarially perturbed input.

A key analytic result shows that the Lipschitz constant of the full MoE model under component-wise coupling can be upper-bounded by

$K_F \leq K_R \cdot \max_i \|E_i\| + \max_i \|R\| \cdot K_{E_i}$

This demonstrates that simultaneous (joint) maximization disrupts the global model more effectively than separated per-component attacks. Empirically, J-TLGA induces router “collapse”—the repeated selection of a single weak expert under attack—resulting in sharply reduced robust accuracy relative to baseline or component-wise attacks.

4. Joint Temporal Lipschitz Adversarial Training (J-TLAT) Protocol

J-TLAT is a min–max adversarial training framework specifically crafted for video MoE. The outer optimization seeks parameters $\theta$ that minimize the worst-case adversarial loss under budgeted perturbations:

$\theta^* = \arg\min_\theta \max_{\|\delta\|_p \leq \epsilon} \left[ L_\mathrm{CE}(F_\theta(x+\delta), y) + \lambda R_\mathrm{temporal}(x, \delta) \right]$

where $R_\mathrm{temporal}(x, \delta) = L_\mathrm{Lip}[F; x, \delta]$ regularizes the global Lipschitz property.

J-TLAT employs a hierarchical, three-stage inner adversarial training cycle for each batch:

Router AT: Generate $\delta_\mathrm{R}$ using TLGA-R, update router parameters $\theta_R$ by optimizing the router’s cross-entropy loss and Lipschitz penalty.
Expert AT: With $\delta_\mathrm{R}$ applied, identify the weakest experts $I$ ; for each $i \in I$ , craft $\delta_{E_i}$ with TLA-E and update $\theta_{E_i}$ accordingly.
Joint MoE AT: Apply J-TLGA to the entire model for a final update.

The following sketch outlines the per-epoch training cycle:

for batch (x, y) in dataloader:
    # Step 1: Router AT
    x_R = TLGA_R(x; θ_R)
    loss_R = LCE(R_θ(x_R), R_θ(x)) + λ_R * LLip[R; x, x_R - x]
    θ_R -= η * ∇_{θ_R} loss_R

    # Step 2: Expert AT
    I = top_k_weak_experts(R_θ(x_R))
    for i in I:
        x_E = TLA_E(x; θ_{E_i})
        loss_Ei = LCE(E_{θ_i}(x_E), y) + λ_E * LLip[E_i; x, x_E - x]
        θ_{E_i} -= η * ∇_{θ_{E_i}} loss_Ei

    # Step 3: Full MoE AT
    x_M = J_TLGA(x; θ)
    loss_M = LCE(F_θ(x_M), y) + λ * LLip[F; x, x_M - x]
    θ -= η * ∇_θ loss_M

\lambda_R

\lambda_E

, and

\lambda

denote Lipschitz weights per component;

\epsilon

’s are typically $8/255$–$14/255$ (L

\infty

) for attack budgets.

5. Empirical Evaluation and Quantitative Insights

J-TLAT has been evaluated on UCF-101 and HMDB-51 video classification benchmarks using four expert architectures: 3D-ResNet-18, TSM, SlowFast, and R(2+1)D. A 4-expert pool is managed by a lightweight Top-1 router. Comparative baselines include dense adversarial training (AT-D), sparse expert AT (AT-S), baseline MoE AT (AT-M), OUDefend (OUD-M), and AAT (AAT-M).

Metrics encompass:

Clean Accuracy (%): On unperturbed videos.
Robust Accuracy (%): Under attacks (FGSM, PGD, AutoAttack, TT, TLA-M, TLA-E, TLGA-R, J-TLGA), with $\epsilon$ in $\{8/255,10/255,12/255,14/255\}$ .
Inference Cost (GFLOPs): For efficiency.
Empirical Lipschitz Constants: For both router (Lips-R) and joint model (Lips-J).
IoU for Routing Consistency: To assess route collapse under attack.

Key findings:

J-TLAT sustains performance: Under the strongest J-TLGA attack ( $\epsilon=14/255$ ), AT-M robustness collapses to $\approx 15.8\%$ , while J-TLAT maintains $\approx 26.9\%$ , a relative gain of $\sim34\%$ .
Cost efficiency: J-TLAT’s inference cost (4-expert MoE) is $\approx1.83$ GFLOPs, compared to $4.79$ GFLOPs for dense ResNet-18, achieving over $60\%$ reduction.
Lipschitz smoothing: Empirical Lipschitz constants decrease dramatically (from $\sim953$ for AAT-M to $0.823$ for J-TLAT under TLGA-R), indicating successful smoothing of the decision boundary.
Ablation studies: Temporal adaptation plus Lipschitz regularization (TLA) surpasses PGD+Lipschitz, PGD, and FGSM in inner training.

6. Implementation and Practical Guidance

Hyperparameters and Setup

Perturbation budgets $\epsilon$ : $8/255$–$14/255$ (L $\infty$ ).
Lipschitz regularization weights: $\lambda_R$ , $\lambda_E$ , $\lambda$ set to $1.0$.
Base attack rate $a$ : $2/255$; momentum $\mu=0.5$ for temporal adaptation.
Learning rate $\eta$ : $5\times10^{-4}$ with cosine annealing over 70 epochs.

Compatibility and Overhead

J-TLAT is plug-and-play: substituting standard adversarial training with the described three-stage cycle integrates seamlessly with any MoE pipeline. Relative to standard training, overall training time approximately doubles due to triple adversarial example generation per batch, but inference time remains nearly unaffected, preserving the computational savings of sparse MoE ( $\approx1.8$ GFLOPs). Memory overhead is modest owing to input buffer and gradient sharing between attacks.

7. Significance and Implications

J-TLAT directly addresses the newly exposed collaborative weaknesses of video MoE models arising from the temporal dimension and the interplay between router and experts. By instantiating component-wise adversarial training both at the router and at the expert level, and combining these with joint global defense, J-TLAT robustly mitigates the Achilles’ Heel of sparse MoE architectures for video. It maintains both superior adversarial robustness and the efficiency advantages of sparse activation, positioning it as an effective and practical defense for real-world video understanding systems (Wang et al., 1 Feb 2026). A plausible implication is that future adversarial research and defenses for temporally structured, modular architectures will need to harmonize component-specific and joint robustness measures in a similar hierarchical fashion.

Markdown Report Issue Upgrade to Chat

References (1)

Exposing and Defending the Achilles' Heel of Video Mixture-of-Experts (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Joint Temporal Lipschitz Adversarial Training (J-TLAT).