Papers
Topics
Authors
Recent
Search
2000 character limit reached

Joint Temporal Lipschitz AT (J-TLAT)

Updated 8 February 2026
  • The paper introduces J-TLAT’s novel defense protocol that jointly targets router and expert vulnerabilities using temporally-adaptive, Lipschitz-guided attacks.
  • It employs a hierarchical three-stage training cycle integrating component-specific and joint adversarial objectives to boost robustness under white-box attacks.
  • Empirical results on UCF-101 and HMDB-51 demonstrate significant improvements in robust accuracy and efficiency over traditional MoE adversarial defenses.

Joint Temporal Lipschitz Adversarial Training (J-TLAT) is a defense framework for video Mixture-of-Experts (MoE) models that targets the collaborative vulnerabilities between routers and expert modules, particularly against adversaries exploiting temporal dependencies. J-TLAT systematically combines component-specific and joint adversarial objectives using temporally-adaptive, Lipschitz-guided attacks and a hierarchical adversarial training protocol. This approach enhances model robustness to sophisticated white-box attacks while preserving the efficiency advantages of sparsely-routed MoE architectures (Wang et al., 1 Feb 2026).

1. Vulnerabilities in Video Mixture-of-Experts

Video MoE models use a lightweight router R()R(\cdot) to select, for each frame or video segment, a small subset of specialized “experts” from a pool, maintaining high capacity at low inference cost. The temporal dimension introduces cumulating effects: minor perturbations in early frames propagate and amplify, exposing two critical attack surfaces—router selection and expert response—as well as their interaction. Standard attacks (PGD, AutoAttack) consider the MoE as a single composite model, maximizing cross-entropy loss LCE(FMoE(x+δ),y)L_\mathrm{CE}(F_\mathrm{MoE}(x+\delta),y) under norm constraints, but neglect the component-wise structure. Attackers can destabilize the system by:

  • Forcing the router towards its lowest-confidence experts, causing a collapse in routing diversity.
  • Directly perturbing weak experts, exposing their individual vulnerabilities.
  • Orchestrating mis-routing and expert degradation simultaneously.

Video-specific adversarial methods (e.g., sparse 3D-perturbations, key-frame attacks, time-frequency methods) optimize perturbations with respect to spatio-temporal budgets, but do not directly exploit the MoE structure and fail to induce the collaborative failure modes that arise when the router and its weakest experts are jointly targeted.

2. Temporal Lipschitz-Guided Attacks (TLGA) Formulation

Temporal Lipschitz-Guided Attacks introduce two principal tools:

  • Lipschitz Ratio Regularization: For any differentiable mapping g()g(\cdot) (router, expert, or MoE), the local Lipschitz loss is defined as

LLip[g;x,δ]=g(x+δ)g(x)22δ22L_\mathrm{Lip}[g; x, \delta] = \frac{\|g(x+\delta)-g(x)\|_2^2}{\|\delta\|_2^2}

Maximizing this loss seeks out directions where small input changes provoke maximal output deviation, pinpointing adversarially sensitive directions.

  • Temporal Adaptive Step-Size: Per-frame gradient norms are accumulated to inform a momentum vector V=(V1,,VT)V = (V_1,\ldots,V_T), and the update step for each frame is

VtμVt+xtL2,ata(1+log(1+Vt))V_t \leftarrow \mu V_t + \|\nabla_{x_t} L\|_2, \quad a_t \leftarrow a(1 + \log(1 + V_t))

Perturbations on more sensitive frames are amplified, enabling more efficient attacks.

The primary TLGA variants and their optimization objectives are outlined in the following table:

Component Loss Function Constraint
Router (TLGA-R) LR(x;δR)=LCE(R(x+δR),yR)+αLLip[R;x,δR]L_\mathrm{R}(x;\delta_\mathrm{R}) = L_\mathrm{CE}(R(x+\delta_\mathrm{R}),y_\mathrm{R}) + \alpha L_\mathrm{Lip}[R; x, \delta_\mathrm{R}] δRpϵR\|\delta_\mathrm{R}\|_p \le \epsilon_\mathrm{R}
Expert (TLA-E) LE(x;δEi)=LCE(Ei(x+δEi),y)+αLLip[Ei;x,δEi]L_\mathrm{E}(x;\delta_{E_i}) = L_\mathrm{CE}(E_i(x+\delta_{E_i}),y) + \alpha L_\mathrm{Lip}[E_i; x, \delta_{E_i}] δEipϵE\|\delta_{E_i}\|_p \le \epsilon_\mathrm{E}
Mixture (TLA-M) LMoE(x;δM)=LCE(F(x+δM),y)+αLLip[F;x,δM]L_\mathrm{MoE}(x;\delta_\mathrm{M}) = L_\mathrm{CE}(F(x+\delta_\mathrm{M}),y) + \alpha L_\mathrm{Lip}[F; x, \delta_\mathrm{M}] δMpϵM\|\delta_\mathrm{M}\|_p \le \epsilon_\mathrm{M}

This structure enables fine-grained manipulation of the MoE model, beyond conventional attacks, by focusing on and amplifying per-component weaknesses.

3. Joint Temporal Lipschitz-Guided Attacks (J-TLGA)

J-TLGA constructs a unified adversarial perturbation δ\delta that simultaneously:

  • Drives the router RR to select its least robust experts (mis-routing).
  • Maximally degrades those selected experts’ performance.
  • Causes misclassification at the final MoE output.

The joint loss maximized by J-TLGA is:

Ljoint(x;δ)=LCE(F(x+δ),y)+βLLip[F;x,δ]+γLCE(R(x+δ),yR)+γLLip[R;x,δ]+iI(LCE(Ei(x+δ),y)+αLLip[Ei;x,δ])L_\mathrm{joint}(x; \delta) = L_\mathrm{CE}(F(x+\delta), y) + \beta L_\mathrm{Lip}[F; x, \delta] + \gamma L_\mathrm{CE}(R(x+\delta), y_\mathrm{R}) + \gamma L_\mathrm{Lip}[R; x, \delta] + \sum_{i \in I} \left( L_\mathrm{CE}(E_i(x+\delta), y) + \alpha L_\mathrm{Lip}[E_i; x, \delta] \right)

where II indexes the weakest experts as determined by the router's output on the adversarially perturbed input.

A key analytic result shows that the Lipschitz constant of the full MoE model under component-wise coupling can be upper-bounded by

KFKRmaxiEi+maxiRKEiK_F \leq K_R \cdot \max_i \|E_i\| + \max_i \|R\| \cdot K_{E_i}

This demonstrates that simultaneous (joint) maximization disrupts the global model more effectively than separated per-component attacks. Empirically, J-TLGA induces router “collapse”—the repeated selection of a single weak expert under attack—resulting in sharply reduced robust accuracy relative to baseline or component-wise attacks.

4. Joint Temporal Lipschitz Adversarial Training (J-TLAT) Protocol

J-TLAT is a min–max adversarial training framework specifically crafted for video MoE. The outer optimization seeks parameters θ\theta that minimize the worst-case adversarial loss under budgeted perturbations:

θ=argminθmaxδpϵ[LCE(Fθ(x+δ),y)+λRtemporal(x,δ)]\theta^* = \arg\min_\theta \max_{\|\delta\|_p \leq \epsilon} \left[ L_\mathrm{CE}(F_\theta(x+\delta), y) + \lambda R_\mathrm{temporal}(x, \delta) \right]

where Rtemporal(x,δ)=LLip[F;x,δ]R_\mathrm{temporal}(x, \delta) = L_\mathrm{Lip}[F; x, \delta] regularizes the global Lipschitz property.

J-TLAT employs a hierarchical, three-stage inner adversarial training cycle for each batch:

  1. Router AT: Generate δR\delta_\mathrm{R} using TLGA-R, update router parameters θR\theta_R by optimizing the router’s cross-entropy loss and Lipschitz penalty.
  2. Expert AT: With δR\delta_\mathrm{R} applied, identify the weakest experts II; for each iIi \in I, craft δEi\delta_{E_i} with TLA-E and update θEi\theta_{E_i} accordingly.
  3. Joint MoE AT: Apply J-TLGA to the entire model for a final update.

The following sketch outlines the per-epoch training cycle:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
for batch (x, y) in dataloader:
    # Step 1: Router AT
    x_R = TLGA_R(x; θ_R)
    loss_R = LCE(R_θ(x_R), R_θ(x)) + λ_R * LLip[R; x, x_R - x]
    θ_R -= η * _{θ_R} loss_R

    # Step 2: Expert AT
    I = top_k_weak_experts(R_θ(x_R))
    for i in I:
        x_E = TLA_E(x; θ_{E_i})
        loss_Ei = LCE(E_{θ_i}(x_E), y) + λ_E * LLip[E_i; x, x_E - x]
        θ_{E_i} -= η * _{θ_{E_i}} loss_Ei

    # Step 3: Full MoE AT
    x_M = J_TLGA(x; θ)
    loss_M = LCE(F_θ(x_M), y) + λ * LLip[F; x, x_M - x]
    θ -= η * _θ loss_M
λR\lambda_R, λE\lambda_E, and λ\lambda denote Lipschitz weights per component; ϵ\epsilon’s are typically $8/255$–$14/255$ (L\infty) for attack budgets.

5. Empirical Evaluation and Quantitative Insights

J-TLAT has been evaluated on UCF-101 and HMDB-51 video classification benchmarks using four expert architectures: 3D-ResNet-18, TSM, SlowFast, and R(2+1)D. A 4-expert pool is managed by a lightweight Top-1 router. Comparative baselines include dense adversarial training (AT-D), sparse expert AT (AT-S), baseline MoE AT (AT-M), OUDefend (OUD-M), and AAT (AAT-M).

Metrics encompass:

  • Clean Accuracy (%): On unperturbed videos.
  • Robust Accuracy (%): Under attacks (FGSM, PGD, AutoAttack, TT, TLA-M, TLA-E, TLGA-R, J-TLGA), with ϵ\epsilon in {8/255,10/255,12/255,14/255}\{8/255,10/255,12/255,14/255\}.
  • Inference Cost (GFLOPs): For efficiency.
  • Empirical Lipschitz Constants: For both router (Lips-R) and joint model (Lips-J).
  • IoU for Routing Consistency: To assess route collapse under attack.

Key findings:

  • J-TLAT sustains performance: Under the strongest J-TLGA attack (ϵ=14/255\epsilon=14/255), AT-M robustness collapses to 15.8%\approx 15.8\%, while J-TLAT maintains 26.9%\approx 26.9\%, a relative gain of 34%\sim34\%.
  • Cost efficiency: J-TLAT’s inference cost (4-expert MoE) is 1.83\approx1.83 GFLOPs, compared to $4.79$ GFLOPs for dense ResNet-18, achieving over 60%60\% reduction.
  • Lipschitz smoothing: Empirical Lipschitz constants decrease dramatically (from 953\sim953 for AAT-M to $0.823$ for J-TLAT under TLGA-R), indicating successful smoothing of the decision boundary.
  • Ablation studies: Temporal adaptation plus Lipschitz regularization (TLA) surpasses PGD+Lipschitz, PGD, and FGSM in inner training.

6. Implementation and Practical Guidance

Hyperparameters and Setup

  • Perturbation budgets ϵ\epsilon: $8/255$–$14/255$ (L\infty).
  • Lipschitz regularization weights: λR\lambda_R, λE\lambda_E, λ\lambda set to $1.0$.
  • Base attack rate aa: $2/255$; momentum μ=0.5\mu=0.5 for temporal adaptation.
  • Learning rate η\eta: 5×1045\times10^{-4} with cosine annealing over 70 epochs.

Compatibility and Overhead

J-TLAT is plug-and-play: substituting standard adversarial training with the described three-stage cycle integrates seamlessly with any MoE pipeline. Relative to standard training, overall training time approximately doubles due to triple adversarial example generation per batch, but inference time remains nearly unaffected, preserving the computational savings of sparse MoE (1.8\approx1.8 GFLOPs). Memory overhead is modest owing to input buffer and gradient sharing between attacks.

7. Significance and Implications

J-TLAT directly addresses the newly exposed collaborative weaknesses of video MoE models arising from the temporal dimension and the interplay between router and experts. By instantiating component-wise adversarial training both at the router and at the expert level, and combining these with joint global defense, J-TLAT robustly mitigates the Achilles’ Heel of sparse MoE architectures for video. It maintains both superior adversarial robustness and the efficiency advantages of sparse activation, positioning it as an effective and practical defense for real-world video understanding systems (Wang et al., 1 Feb 2026). A plausible implication is that future adversarial research and defenses for temporally structured, modular architectures will need to harmonize component-specific and joint robustness measures in a similar hierarchical fashion.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Joint Temporal Lipschitz Adversarial Training (J-TLAT).