Minimum Latency Training (MLT)

Updated 16 March 2026

Minimum Latency Training (MLT) is a framework that minimizes delay by directly targeting time-to-decision, processing, or communication latency across various tasks.
MLT leverages techniques such as model splitting in federated learning, alignment masking in streaming ASR, and iterative step reduction in spiking neural networks to optimize performance.
Empirical studies show that MLT can significantly cut latency—with reductions up to 58% in federated contexts and notable gains in energy efficiency for SNNs—while preserving model accuracy.

Minimum Latency Training (MLT) refers to a class of machine learning algorithms, architectures, and training strategies that explicitly target the minimization of processing, communication, or inference delay—measured as physical time, processing steps, frames, or communication rounds—within a given task. The goal of MLT is to optimize the time-to-decision or time-to-consensus, sometimes maintaining or even improving model accuracy and system efficiency relative to conventional approaches.

1. Definitions and Problem Formulations

MLT encompasses a spectrum of problem domains, each characterized by bespoke latency definitions:

Federated/Split Learning: In federated edge learning with split models, latency is defined as the maximum local training round time across all clients. The Minimum Latency Training problem is formulated as the minimax objective over client completion times, subject to model-splitting policies and server resource constraints (Wen et al., 2023).
Streaming Sequence Transduction (e.g., ASR): For streaming sequence-to-sequence models using monotonic attention (MoChA, CA), latency is the offset between the emitted output token index and the ground-truth acoustic boundary. MLT is realized via direct penalization (differentiable expectation over delays or hard alignment masking) over the token emission process (Li et al., 2023, Inaguma et al., 2020, Shinohara et al., 2022).
Spiking Neural Networks (SNNs): Latency is intrinsically the number of simulation steps required to reach a stable output. MLT seeks to enable SNNs to produce correct outputs in the minimum number of steps—ideally a single step—by enhancing integration efficiency within each step (Chowdhury et al., 2021, Yao et al., 2024).

2. MLT Methodologies Across Domains

2.1 Federated Edge Learning with Model Splitting

In the SFL (Split Federated Learning) framework (Wen et al., 2023), each client splits the global DNN model at a chosen layer $\kappa_i$ . The PS allocates a computational budget $f_i$ to each client for training the server-side partition. The per-client round time is

$T_i(\kappa_i, f_i) = 2\,\frac{\lvert w_i^{\rm C}(\kappa_i)\rvert}{r_i} + I_i|\mathcal{B}_i|\left[ \frac{F_i^{\rm C}(\kappa_i)+B_i^{\rm C}(\kappa_i)}{f_i^{\rm C}} + \frac{F_i^{\rm S}(\kappa_i)+B_i^{\rm S}(\kappa_i)}{f_i} +2\,\frac{\Lambda_i(\kappa_i)}{r_i} \right]$

The minimax TLMP is subject to discrete cut-layer selection, server FLOPS constraints, and model consistency. To permit tractable optimization, regression-based surrogates for $\left|w_i^{\rm C}(\kappa)\right|$ , $F_i^{\rm tot}(\kappa)$ , and $\Lambda_i(\kappa)$ are fitted, and the resulting continuous relaxation is solved via alternate optimization over $\{\kappa_i\}$ and $\{f_i\}$ until convergence.

2.2 Streaming Speech Recognition (ASR) and Sequence Transduction

Monotonic Attention (MoChA/CA) and Sequence Transducer MLT:

Latency is measured as $\Delta_i = \hat{b}_i - b_i$ , with $\hat{b}_i$ the model's emission time and $b_i$ the reference alignment. Approaches:

Expected-Latency Regularization: Augment the cross-entropy loss with a term

$L_{\rm MinLT} = \frac{1}{L}\sum_{i=1}^L \left| \mathbb{E}_\alpha[j] - b_i \right|$

with the expectation taken over the model’s marginal emission probabilities (Inaguma et al., 2020, Shinohara et al., 2022).

Alignment Masking: For both conventional and self-regularized MLT, the attention distribution $\alpha_{i,j}$ is zeroed beyond a boundary $b_i+\delta$ —either statically from external alignments or adaptively from the model's own history—ensuring that emissions cannot be delayed beyond a tunable window (Li et al., 2023, Inaguma et al., 2020).
Differentiable Delay Penalty in Sequence Transducers: Expected delay is summed across lattice diagonals and incorporated via a trade-off parameter $\lambda$ ; this modifies the transducer gradient with terms that reward or penalize local emission moves according to their delay with respect to the reference path (Shinohara et al., 2022).

2.3 Minimum-Step Spiking Neural Network Training

Iterative Initialization and Retraining (IIR-SNN): Starting with a multi-step ( $T=5$ ) SNN trained by surrogate-gradient BPTT, the model is iteratively fine-tuned for successively lower step counts ( $T=4,3,2,1$ ), each time initializing from the converged parameters of the previous $T$ . This curriculum mitigates spike vanishing and enables single-shot ( $T=1$ ) inference with minimal accuracy loss (Chowdhury et al., 2021).
One-Step SNNs with Feature Fusion: Minimum latency SNNs for both convolutional and recurrent architectures can be constructed by partitioning feature maps into spatial “windows,” computing current and recurrent stimuli, and fusing them via a bounded nonlinear projection function $\Omega$ at each step (Yao et al., 2024). With this design, full spatio-temporal integration is achievable in a single time step.

3. Algorithms and Optimization Strategies

Alternate Optimization (SFL): The TLMP in SFL is decomposed due to weak coupling: with $\{f_i\}$ fixed, minimization over $\kappa_i$ is convex per client; with $\{\kappa_i\}$ fixed, $\{f_i\}$ is solved via constrained resource allocation, pushing bottleneck clients to shared latency (Wen et al., 2023).
Self-Regularisation for Alignment Boundaries: In Self-Regularised MLT for streaming ASR, boundaries $b_i$ are periodically updated on a mini-batch basis if accuracy is non-decreasing and coverage is improved, yielding latency reduction with guaranteed accuracy preservation (Li et al., 2023).
Delay-Constrained Alignment Path Pruning (DeCoT): In streaming S2S, backward recursions over attention alignments are restricted to paths within $\delta$ frames of $b_i$ , with an auxiliary quantity loss to avoid degenerate attention distributions (Inaguma et al., 2020).
Gradient Rescaling by Expected Delay: For transducers, the loss gradient at each lattice cell $(t,u)$ is rescaled by the local excess or deficit in delay relative to the expected diagonal latency, providing fine-grained control over label emission timing (Shinohara et al., 2022).
Temporal Curriculum for SNNs: Sequential time-axis network compression (from higher $T$ to $T=1$ ) with inherited state and thresholds overcomes the under-firing problem experienced by direct $T=1$ minimization (Chowdhury et al., 2021).

4. Empirical Results and Quantitative Analysis

Federated Edge Learning

Method	Per-Round Latency	Test Accuracy
FedAvg	980 s	~90%
SFL-MLT	410 s (~58%↓)	~90%

Regression fits for surrogate model statistics achieved $R^2=0.95$ for client-side model size and $R^2=0.97$ for total client-side flops (Wen et al., 2023).

Streaming Recognition

Dataset	MLT Approach	Latency Reduction	Accuracy Impact
AIShell-1	SR-MLT + MoChA	39.5%	No loss (CER=6.3)
AIShell-1	SR-MLT + CA	11.8%	No loss (CER=6.4)
Librispeech	SR-MLT + CA	26.1% (test)	Negligible
Cortana (MoChA)	DeCoT δ=24	>40%	-11% rel. WER
WSJ (Conf-T)	MLT $\lambda=0.03$	220→27 ms	+0.7 pp WER

Conventional MLT with external alignments can yield tradeoffs, but SR-MLT achieves lower latency at iso-accuracy (Li et al., 2023).

Spiking Neural Networks

Dataset	T=1 SNN Accuracy	Energy vs. ANN	Energy vs. SNN
CIFAR-10	93.05% (Chowdhury et al., 2021) / 93.07% (Yao et al., 2024)	33×	3-10×
CIFAR-100	70.15% / 72.41%	29×	3×
ImageNet	67.71%	24.6×	–

MLT-SNNs run at unit step latency, using up to 2500× fewer steps and achieving energy savings by switching to addition-dominated computation and reducing memory access (Chowdhury et al., 2021, Yao et al., 2024).

5. Practical Implementations and Trade-Offs

Hyperparameter Tuning: All domains require tuning of surrogate loss weights (e.g., $\lambda$ for latency penalty) and operational parameters (e.g., cut-layer ranges, batch sizes, masking offsets $\delta$ ).
Computational Overhead: Most MLT algorithms leverage efficient backward/forward recursions, auxiliary statistics, or iteration policies, minimizing extra computational burden relative to standard pipelines.
Accuracy-Latency Trade-Off: While MLT generally maintains or modestly sacrifices accuracy, overly aggressive latency constraints can degrade recognition or inference performance, necessitating careful regularization and curriculum design.
Resource Efficiency: In edge and SNN contexts, MLT directly enables operation at reduced energy, bandwidth, and memory-load levels, thus facilitating deployment on constrained platforms.

6. Significance, Insights, and Recommendations

Generalizability: MLT frameworks are adaptable to a variety of architectures, including RNN-T, Conformer-T, Transformer-based ASR, SFL federated learners, and SNNs (Wen et al., 2023, Li et al., 2023, Inaguma et al., 2020, Shinohara et al., 2022, Chowdhury et al., 2021, Yao et al., 2024).
Self-Regularisation: Self-regularised alignment and boundary-update mechanisms, when correctly applied, reliably balance latency reduction and accuracy retention without external supervision (Li et al., 2023).
Curriculum Approaches: Progressive reduction in SNN simulation steps prevents spike vanishing and unlocks ultra-low latency operation with near-optimal accuracy (Chowdhury et al., 2021).
Model-Splitting and Resource Allocation: In federated contexts, joint optimization of model partition points and server resource allocation yields significant latency gains under realistic network and device constraints (Wen et al., 2023).
Implementation Practicality: MLT algorithms are directly compatible with existing frameworks (e.g., PyTorch for S2S and SNNs) and require minimal custom components, mostly in loss, attention masking, or curriculum scheduling modules.

7. Limitations and Prospects

MLT is currently bounded by the accuracy-latency Pareto frontier—excessive pressure to minimize delay may induce performance degradation if not adequately regularized. The optimal choice of partitioning schemes for model-splitting, window partition hyperparameters (in SNNs), and thresholds for masking/penalties remains application-dependent. A plausible implication is that future work will focus on adaptive, data-driven tuning of these meta-parameters, new forms of self-regularization and curriculum, and the integration of MLT into broader multi-objective optimization frameworks.

References:

"Training Latency Minimization for Model-Splitting Allowed Federated Edge Learning" (Wen et al., 2023)
"Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition" (Li et al., 2023)
"Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR" (Inaguma et al., 2020)
"One Timestep is All You Need: Training Spiking Neural Networks with Ultra Low Latency" (Chowdhury et al., 2021)
"Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition" (Shinohara et al., 2022)
"Training a General Spiking Neural Network with Improved Efficiency and Minimum Latency" (Yao et al., 2024)

Markdown Report Issue Upgrade to Chat

References (6)

Training Latency Minimization for Model-Splitting Allowed Federated Edge Learning (2023)

Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition (2023)

Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR (2020)

Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition (2022)

One Timestep is All You Need: Training Spiking Neural Networks with Ultra Low Latency (2021)

Training a General Spiking Neural Network with Improved Efficiency and Minimum Latency (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Minimum Latency Training (MLT).

Minimum Latency Training (MLT)

1. Definitions and Problem Formulations

2. MLT Methodologies Across Domains

2.1 Federated Edge Learning with Model Splitting

2.2 Streaming Speech Recognition (ASR) and Sequence Transduction

2.3 Minimum-Step Spiking Neural Network Training

3. Algorithms and Optimization Strategies

4. Empirical Results and Quantitative Analysis

Federated Edge Learning

Streaming Recognition

Spiking Neural Networks

5. Practical Implementations and Trade-Offs

6. Significance, Insights, and Recommendations

7. Limitations and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Minimum Latency Training (MLT)

1. Definitions and Problem Formulations

2. MLT Methodologies Across Domains

2.1 Federated Edge Learning with Model Splitting

2.2 Streaming Speech Recognition (ASR) and Sequence Transduction

2.3 Minimum-Step Spiking Neural Network Training

3. Algorithms and Optimization Strategies

4. Empirical Results and Quantitative Analysis

Federated Edge Learning

Streaming Recognition

Spiking Neural Networks

5. Practical Implementations and Trade-Offs

6. Significance, Insights, and Recommendations

7. Limitations and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research