Alternating Training for Multi-Module (ATMM)

Updated 7 March 2026

ATMM is a modular framework that alternates parameter updates across network modules to reduce gradient conflicts and enhance convergence.
It encompasses strategies such as epoch-based alternation, blockwise updates with adaptive step sizes, and alternating supervised/unsupervised methods.
ATMM improves generalization and robustness across applications like multi-task learning, cross-modal fusion, and noisy-label correction.

Alternating Training for Multi-Module (ATMM) is a family of optimization strategies broadly characterized by cyclically, randomly, or deterministically alternating the update of parameters between different modules (blocks, tasks, or subnetworks) within a neural architecture. Unlike conventional joint- or multi-task training, where all modules’ parameters are optimized simultaneously, ATMM seeks to reduce gradient interference, stabilize convergence, improve regularization, and enable efficient multi-objective learning by enforcing explicit alternation of parameter updates. This paradigm has been implemented for a variety of scenarios, including multi-task branches in shared neural networks, cross-modal fusion systems, deep clustering/classification frameworks, and layer-wise or blockwise parameter partitioning.

1. General Principles and Algorithmic Frameworks

ATMM encompasses methods wherein modules within a neural network—such as task-specific heads, distinct pre-trained backbones, clustering/classification pairs, or architectural blocks—are updated one at a time or in alternation rather than jointly. In multi-task or multi-modal architectures, this usually corresponds to updating only the parameters belonging to one module, while other modules (and sometimes shared fusion layers) are held fixed or partially unfrozen.

For hard-parameter-sharing multi-task neural networks (MTNNs), as studied in “ATE-SG: Alternate Through the Epochs Stochastic Gradient for Multi-Task Neural Networks” (Bellavia et al., 2023), the alternate training procedure cycles between epochs in which only the shared parameters are updated and epochs in which only the task-specific (branch) parameters are updated. The overall loss is typically a weighted sum across tasks or modules.

In cross-modal architectures for automatic speaker verification (SASV), as in “ATMM-SAGA: Alternating Training for Multi-Module with Score-Aware Gated Attention SASV system” (Asali et al., 23 May 2025) and “ELEAT-SAGA” (Asali et al., 14 Feb 2026), ATMM iteratively alternates between updates focused on the anti-spoofing countermeasure (CM) branch and speaker verification (ASV) branch, with only the relevant weights unfrozen in each step, and specific loss weighting.

Blockwise variations, such as “SAMT: Neural Network Training via Stochastic Alternating Minimization with Trainable Step Sizes” (Yan et al., 6 Aug 2025), treat each layer or block as a module and cyclically update one at a time with potentially adaptive, learnable step sizes.

In robust deep clustering for cyberattack resilience, as in “A Multi-module Robust Method for Transient Stability Assessment against False Label Injection Cyberattacks” (Wang et al., 2024), ATMM alternates between optimizing a supervised classification module and an unsupervised clustering module, coordinating label correction between the two.

2. Representative Algorithmic Variants

Several concrete instantiations of ATMM are found in recent literature:

A. Alternate Through the Epochs (ATE-SG)

Multi-head MTNN with parameter vector $W = [W_{\rm shared}, W_1, ..., W_K]$ .
Training cycles alternate $E_0$ epochs of updating only $W_{\rm shared}$ (trunk), then $E_{\rm ts}$ epochs of updating only $W_{\rm ts} = [W_1, ..., W_K]$ (branches), per (Bellavia et al., 2023).
Within each phase, standard mini-batch SGD is applied, and the switch is implemented by toggling requires_grad flags.

B. Blockwise Alternating with Trainable Step Sizes (SAMT)

Treat layers or blocks $W_1,..., W_M$ as separate modules; in each iteration, update a single block.
Each update employs a block-specific, meta-learned or trainable step size $\alpha_i^t$ , potentially scalar, element-wise, row-wise, or column-wise.
The step size is updated via a convex combination leveraging statistics of the local block gradient, as in (Yan et al., 6 Aug 2025).

In SASV systems (“ATMM-SAGA”), alternate between “CM-step” and “ASV-step” updates:
- For "CM-step": unfreeze CM and fusion, freeze ASV, loss weight $\lambda=0.1$ .
- For "ASV-step": unfreeze ASV and fusion, freeze CM, loss weight $\lambda=0.9$ .
Losses are binary cross-entropy on each task, combined via $\mathcal{L}_{\rm total} = \lambda\mathcal{L}_{\mathrm{ASV}} + (1-\lambda) \mathcal{L}_{\mathrm{CM}}$ (Asali et al., 23 May 2025, Asali et al., 14 Feb 2026).

D. Alternated Supervised/Unsupervised Updates with Label Correction

As in MMR for robust learning under noisy labels (Wang et al., 2024), alternate between:
1. Minimizing a supervised loss (classification, plus reconstruction).
2. Minimizing an unsupervised loss (deep clustering).
3. Updating soft labels via a weighted average of classifier and clustering predictions.
Alternation supports robust label correction against false label injection.

E. Alternating Data-Source Updates in Segmentation

In unpaired multi-modal segmentation (CT/MR), as in (Li et al., 2024), batches from CT and MR domains are interleaved at the iteration or minibatch level, allowing shared parameters to adapt sequentially to each modality.

3. Mathematical Formulations and Update Dynamics

The general ATMM paradigm can be formalized in terms of alternating minimization of an objective $F(W)$ , decomposed into module-wise updates: $\min_{W} F(W) = \mathbb{E}_{(x,y)\sim S}[ \ell(\varphi(x;W), y) ]$ For blockwise alternation (Yan et al., 6 Aug 2025): $W_i^{t+1} = W_i^t - \alpha_i^t \cdot g_i^t$ with $g_i^t$ denoting the mini-batch stochastic gradient with respect to the $i$ th block holding other blocks fixed. Adaptive step sizes $\alpha_i^t$ are learned via meta-learning mechanisms summarized by an MLP function $\psi$ on block-gradient statistics.

In multi-task settings (Bellavia et al., 2023), the update rules alternate as: $W_{\rm shared}^{(i+1)} = W_{\rm shared}^{(i)} - \eta_i \nabla_{W_{\rm shared}} \mathcal{L}(\mathcal{B}; W_{\rm shared}^{(i)}, W_{\rm ts}^{(i)})$

$W_{\rm ts}^{(i+1)} = W_{\rm ts}^{(i)} - \eta_i \nabla_{W_{\rm ts}} \mathcal{L}(\mathcal{B}; W_{\rm shared}^{(i)}, W_{\rm ts}^{(i)})$

Pseudocode for SASV-focused ATMM (Asali et al., 23 May 2025, Asali et al., 14 Feb 2026):

if p == 0:  # CM focus
    lambda = 0.1
    Freeze(ASV branch)
    Unfreeze(CM branch & fusion)
    batch = sample(D_CM)
else:       # ASV focus
    lambda = 0.9
    Freeze(CM branch)
    Unfreeze(ASV branch & fusion)
    batch = sample(D_ASV)
forward pass, losses, backward, optimizer step on unfrozen weights

This alternation is directly linked to the mitigation of gradient conflict in joint multi-task learning and to implicit regularization.

4. Theoretical Properties and Convergence Guarantees

Theoretical results for ATMM appear under various stochastic optimization models. For ATE-SG (Bellavia et al., 2023), under standard smoothness and unbiasedness assumptions, the iterates $\{W^{(i)}\}$ satisfy: $\liminf_{i\to\infty}\|\nabla\mathcal{L}(W^{(i)})\| = 0 \quad \textrm{a.s.}$ The proof combines the standard descent lemma for separate update steps and the Robbins–Siegmund supermartingale argument.

In blockwise SAMT (Yan et al., 6 Aug 2025), under strong concavity and gradient-stability assumptions, the iterates converge, in expectation, linearly in the noise-free case and to a small residual error proportional to the variance otherwise: $\mathbb{E}\left[\sum_{i=1}^M \|\Delta_i^t\|^2\right] \leq O(e^{-ct}) + O(\sigma^2)$ where $\Delta_i^t := W_i^t - W_i^*$ and $c > 0$ .

A plausible implication is that alternation preserves the convergence profile of SGD, but may enhance it in multi-module settings through reduced gradient conflict, especially when modules are initialized or pre-trained on non-overlapping domains.

5. Empirical Performance and Application Domains

Empirical studies consistently demonstrate that ATMM delivers superior regularization and improved generalization, especially in settings prone to gradient interference or label noise.

Multi-task NN (ATE-SG): On synthetic and wireless signal data, ATE-SG yields smoother training curves, delays overfitting, and achieves higher test-set accuracy and F1 than standard SGD, especially for $(E_0,E_{\rm ts})=(1,1)$ (Bellavia et al., 2023). Computational cost and memory usage are also reduced.
SASV (ATMM-SAGA, ELEAT-SAGA): Incorporating ATMM with score-aware gated attention cuts SASV-EER from ~6.5% to ~2.2% on ASVspoof2019 Eval (min a-DCF: 0.0480) (Asali et al., 23 May 2025), and a refined EAT variant achieves as low as 1.22% EER (min a-DCF: 0.0303) (Asali et al., 14 Feb 2026).
Robust clustering/classification (MMR): In the context of 30% symmetric false-label injection on a 4,300-sample TSA set, MMR with ATMM achieves 96.7% accuracy (naïve CNN/GRU baselines: 86–89%), while MMR-HIL raises accuracy to 98.7% in half the epochs (Wang et al., 2024).
Segmentation (MulModSeg): Alternating unpaired CT/MR batch updates with modality-conditioned text priors improves Dice to 82.5 (CT) and 81.91 (MR), surpassing single-modality training by 1–3 points; adding text prior with ATMM achieves up to +6.48 Dice improvement (Li et al., 2024).
Blockwise learning (SAMT): Blockwise ATMM with adaptive step sizes achieves >98% on MNIST and 1–2% higher test accuracy than SGD/Adam on CIFAR datasets, with improved stability in late training (Yan et al., 6 Aug 2025).

6. Robustness and Regularization Mechanisms

Alternation mitigates overfitting and increases robustness via several mechanisms documented in the literature:

Gradient interference between tasks or domains is reduced, enabling specialized module adaptation without destructive interference in shared representations (Bellavia et al., 2023, Asali et al., 23 May 2025, Asali et al., 14 Feb 2026).
Robustness to label noise is enhanced: the unsupervised module (deep clustering) can correct representations distorted by false labels, and alternation between supervised and unsupervised updates enables >95% effective label correction (Wang et al., 2024).
In multimodal settings (e.g., CT/MR), alternation avoids instability caused by joint-domain batches and allows batchnorm statistics and shared heads to adapt to each domain sequentially (Li et al., 2024).
Empirically, alternation acts as an implicit regularizer, with training/validation loss curves exhibiting delayed divergence (i.e., regularization is enhanced compared to joint-SGD).

7. Implementation Guidance and Practical Considerations

Epoch/phase scheduling: In ATE-SG, $(E_0, E_{\rm ts})=(1,1)$ cycles are most robust; longer phases reintroduce loss oscillations (Bellavia et al., 2023).
Freezing/unfreezing: Standard frameworks (PyTorch, Keras) support toggling trainable modules via the requires_grad attribute or similar; careful param management is required when modules share batchnorm or running stats.
Learning rates: Use standard SGD/Adam or more advanced meta-learned schedules as in SAMT; step-size meta-learning can be implemented via additional lightweight MLPs (Yan et al., 6 Aug 2025).
Label correction: For ATMM with unsupervised module alternation, post-epoch consensus label restoration is key; soft label updates should be weighted by confidence parameters as described (Wang et al., 2024).
Data scheduling: In multi-source data settings, batches/iterations may alternate per-iteration, per-epoch, or via random binary draws for stochastic balancing (Li et al., 2024, Asali et al., 23 May 2025).
Early stopping and diagnostics: Monitor phase-wise gradients and losses separately; lack of progress in either phase may indicate the need for learning rate adaptation or phase length tuning.

Summary Table: Key ATMM Variants and Application Domains

Variant	Modules/Blocks	Application Domain	Reference
ATE-SG	Shared vs task-specific	Multi-task neural nets	(Bellavia et al., 2023)
SAMT	Layers/blocks	Deep networks (MLP, CNN)	(Yan et al., 6 Aug 2025)
ATMM-SAGA/ELEAT-SAGA	ASV/CM branches + fusion	Spoof-robust speaker verification	(Asali et al., 23 May 2025, Asali et al., 14 Feb 2026)
MMR	Classifier vs clustering	Noisy-label time-series (TSA)	(Wang et al., 2024)
MulModSeg-ALT	CT/MR batch alternation	Unpaired multi-modal segmentation	(Li et al., 2024)

ATMM thus constitutes a versatile paradigm for managing modular neural architectures in the presence of multi-task, multi-modal, noisy, or otherwise conflicting learning objectives, delivering provable and empirically validated gains in robustness, computational efficiency, and generalization.

Markdown Report Issue Upgrade to Chat

References (6)

ATE-SG: Alternate Through the Epochs Stochastic Gradient for Multi-Task Neural Networks (2023)

ATMM-SAGA: Alternating Training for Multi-Module with Score-Aware Gated Attention SASV system (2025)

ELEAT-SAGA: Early & Late Integration with Evading Alternating Training for Spoof-Robust Speaker Verification (2026)

Neural Network Training via Stochastic Alternating Minimization with Trainable Step Sizes (2025)

A Multi-module Robust Method for Transient Stability Assessment against False Label Injection Cyberattacks (2024)

MulModSeg: Enhancing Unpaired Multi-Modal Medical Image Segmentation with Modality-Conditioned Text Embedding and Alternating Training (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Alternating Training for Multi-Module (ATMM).

Alternating Training for Multi-Module (ATMM)

1. General Principles and Algorithmic Frameworks

2. Representative Algorithmic Variants

A. Alternate Through the Epochs (ATE-SG)

B. Blockwise Alternating with Trainable Step Sizes (SAMT)

D. Alternated Supervised/Unsupervised Updates with Label Correction

E. Alternating Data-Source Updates in Segmentation

3. Mathematical Formulations and Update Dynamics

4. Theoretical Properties and Convergence Guarantees

5. Empirical Performance and Application Domains

6. Robustness and Regularization Mechanisms

7. Implementation Guidance and Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Alternating Training for Multi-Module (ATMM)

1. General Principles and Algorithmic Frameworks

2. Representative Algorithmic Variants

A. Alternate Through the Epochs (ATE-SG)

B. Blockwise Alternating with Trainable Step Sizes (SAMT)

C. Multi-Branch Alternation in Multi-Modal Fusion

D. Alternated Supervised/Unsupervised Updates with Label Correction

E. Alternating Data-Source Updates in Segmentation

3. Mathematical Formulations and Update Dynamics

4. Theoretical Properties and Convergence Guarantees

5. Empirical Performance and Application Domains

6. Robustness and Regularization Mechanisms

7. Implementation Guidance and Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research