Papers
Topics
Authors
Recent
Search
2000 character limit reached

Minimum Latency Training (MLT)

Updated 16 March 2026
  • Minimum Latency Training (MLT) is a framework that minimizes delay by directly targeting time-to-decision, processing, or communication latency across various tasks.
  • MLT leverages techniques such as model splitting in federated learning, alignment masking in streaming ASR, and iterative step reduction in spiking neural networks to optimize performance.
  • Empirical studies show that MLT can significantly cut latency—with reductions up to 58% in federated contexts and notable gains in energy efficiency for SNNs—while preserving model accuracy.

Minimum Latency Training (MLT) refers to a class of machine learning algorithms, architectures, and training strategies that explicitly target the minimization of processing, communication, or inference delay—measured as physical time, processing steps, frames, or communication rounds—within a given task. The goal of MLT is to optimize the time-to-decision or time-to-consensus, sometimes maintaining or even improving model accuracy and system efficiency relative to conventional approaches.

1. Definitions and Problem Formulations

MLT encompasses a spectrum of problem domains, each characterized by bespoke latency definitions:

  • Federated/Split Learning: In federated edge learning with split models, latency is defined as the maximum local training round time across all clients. The Minimum Latency Training problem is formulated as the minimax objective over client completion times, subject to model-splitting policies and server resource constraints (Wen et al., 2023).
  • Streaming Sequence Transduction (e.g., ASR): For streaming sequence-to-sequence models using monotonic attention (MoChA, CA), latency is the offset between the emitted output token index and the ground-truth acoustic boundary. MLT is realized via direct penalization (differentiable expectation over delays or hard alignment masking) over the token emission process (Li et al., 2023, Inaguma et al., 2020, Shinohara et al., 2022).
  • Spiking Neural Networks (SNNs): Latency is intrinsically the number of simulation steps required to reach a stable output. MLT seeks to enable SNNs to produce correct outputs in the minimum number of steps—ideally a single step—by enhancing integration efficiency within each step (Chowdhury et al., 2021, Yao et al., 2024).

2. MLT Methodologies Across Domains

2.1 Federated Edge Learning with Model Splitting

In the SFL (Split Federated Learning) framework (Wen et al., 2023), each client splits the global DNN model at a chosen layer κi\kappa_i. The PS allocates a computational budget fif_i to each client for training the server-side partition. The per-client round time is

Ti(κi,fi)=2wiC(κi)ri+IiBi[FiC(κi)+BiC(κi)fiC+FiS(κi)+BiS(κi)fi+2Λi(κi)ri]T_i(\kappa_i, f_i) = 2\,\frac{\lvert w_i^{\rm C}(\kappa_i)\rvert}{r_i} + I_i|\mathcal{B}_i|\left[ \frac{F_i^{\rm C}(\kappa_i)+B_i^{\rm C}(\kappa_i)}{f_i^{\rm C}} + \frac{F_i^{\rm S}(\kappa_i)+B_i^{\rm S}(\kappa_i)}{f_i} +2\,\frac{\Lambda_i(\kappa_i)}{r_i} \right]

The minimax TLMP is subject to discrete cut-layer selection, server FLOPS constraints, and model consistency. To permit tractable optimization, regression-based surrogates for wiC(κ)\left|w_i^{\rm C}(\kappa)\right|, Fitot(κ)F_i^{\rm tot}(\kappa), and Λi(κ)\Lambda_i(\kappa) are fitted, and the resulting continuous relaxation is solved via alternate optimization over {κi}\{\kappa_i\} and {fi}\{f_i\} until convergence.

2.2 Streaming Speech Recognition (ASR) and Sequence Transduction

Monotonic Attention (MoChA/CA) and Sequence Transducer MLT:

Latency is measured as Δi=b^ibi\Delta_i = \hat{b}_i - b_i, with b^i\hat{b}_i the model's emission time and bib_i the reference alignment. Approaches:

  • Expected-Latency Regularization: Augment the cross-entropy loss with a term

LMinLT=1Li=1LEα[j]biL_{\rm MinLT} = \frac{1}{L}\sum_{i=1}^L \left| \mathbb{E}_\alpha[j] - b_i \right|

with the expectation taken over the model’s marginal emission probabilities (Inaguma et al., 2020, Shinohara et al., 2022).

  • Alignment Masking: For both conventional and self-regularized MLT, the attention distribution αi,j\alpha_{i,j} is zeroed beyond a boundary bi+δb_i+\delta—either statically from external alignments or adaptively from the model's own history—ensuring that emissions cannot be delayed beyond a tunable window (Li et al., 2023, Inaguma et al., 2020).
  • Differentiable Delay Penalty in Sequence Transducers: Expected delay is summed across lattice diagonals and incorporated via a trade-off parameter λ\lambda; this modifies the transducer gradient with terms that reward or penalize local emission moves according to their delay with respect to the reference path (Shinohara et al., 2022).

2.3 Minimum-Step Spiking Neural Network Training

  • Iterative Initialization and Retraining (IIR-SNN): Starting with a multi-step (T=5T=5) SNN trained by surrogate-gradient BPTT, the model is iteratively fine-tuned for successively lower step counts (T=4,3,2,1T=4,3,2,1), each time initializing from the converged parameters of the previous TT. This curriculum mitigates spike vanishing and enables single-shot (T=1T=1) inference with minimal accuracy loss (Chowdhury et al., 2021).
  • One-Step SNNs with Feature Fusion: Minimum latency SNNs for both convolutional and recurrent architectures can be constructed by partitioning feature maps into spatial “windows,” computing current and recurrent stimuli, and fusing them via a bounded nonlinear projection function Ω\Omega at each step (Yao et al., 2024). With this design, full spatio-temporal integration is achievable in a single time step.

3. Algorithms and Optimization Strategies

  • Alternate Optimization (SFL): The TLMP in SFL is decomposed due to weak coupling: with {fi}\{f_i\} fixed, minimization over κi\kappa_i is convex per client; with {κi}\{\kappa_i\} fixed, {fi}\{f_i\} is solved via constrained resource allocation, pushing bottleneck clients to shared latency (Wen et al., 2023).
  • Self-Regularisation for Alignment Boundaries: In Self-Regularised MLT for streaming ASR, boundaries bib_i are periodically updated on a mini-batch basis if accuracy is non-decreasing and coverage is improved, yielding latency reduction with guaranteed accuracy preservation (Li et al., 2023).
  • Delay-Constrained Alignment Path Pruning (DeCoT): In streaming S2S, backward recursions over attention alignments are restricted to paths within δ\delta frames of bib_i, with an auxiliary quantity loss to avoid degenerate attention distributions (Inaguma et al., 2020).
  • Gradient Rescaling by Expected Delay: For transducers, the loss gradient at each lattice cell (t,u)(t,u) is rescaled by the local excess or deficit in delay relative to the expected diagonal latency, providing fine-grained control over label emission timing (Shinohara et al., 2022).
  • Temporal Curriculum for SNNs: Sequential time-axis network compression (from higher TT to T=1T=1) with inherited state and thresholds overcomes the under-firing problem experienced by direct T=1T=1 minimization (Chowdhury et al., 2021).

4. Empirical Results and Quantitative Analysis

Federated Edge Learning

Method Per-Round Latency Test Accuracy
FedAvg 980 s ~90%
SFL-MLT 410 s (~58%↓) ~90%

Regression fits for surrogate model statistics achieved R2=0.95R^2=0.95 for client-side model size and R2=0.97R^2=0.97 for total client-side flops (Wen et al., 2023).

Streaming Recognition

Dataset MLT Approach Latency Reduction Accuracy Impact
AIShell-1 SR-MLT + MoChA 39.5% No loss (CER=6.3)
AIShell-1 SR-MLT + CA 11.8% No loss (CER=6.4)
Librispeech SR-MLT + CA 26.1% (test) Negligible
Cortana (MoChA) DeCoT δ=24 >40% -11% rel. WER
WSJ (Conf-T) MLT λ=0.03\lambda=0.03 220→27 ms +0.7 pp WER

Conventional MLT with external alignments can yield tradeoffs, but SR-MLT achieves lower latency at iso-accuracy (Li et al., 2023).

Spiking Neural Networks

Dataset T=1 SNN Accuracy Energy vs. ANN Energy vs. SNN
CIFAR-10 93.05% (Chowdhury et al., 2021) / 93.07% (Yao et al., 2024) 33× 3-10×
CIFAR-100 70.15% / 72.41% 29×
ImageNet 67.71% 24.6×

MLT-SNNs run at unit step latency, using up to 2500× fewer steps and achieving energy savings by switching to addition-dominated computation and reducing memory access (Chowdhury et al., 2021, Yao et al., 2024).

5. Practical Implementations and Trade-Offs

  • Hyperparameter Tuning: All domains require tuning of surrogate loss weights (e.g., λ\lambda for latency penalty) and operational parameters (e.g., cut-layer ranges, batch sizes, masking offsets δ\delta).
  • Computational Overhead: Most MLT algorithms leverage efficient backward/forward recursions, auxiliary statistics, or iteration policies, minimizing extra computational burden relative to standard pipelines.
  • Accuracy-Latency Trade-Off: While MLT generally maintains or modestly sacrifices accuracy, overly aggressive latency constraints can degrade recognition or inference performance, necessitating careful regularization and curriculum design.
  • Resource Efficiency: In edge and SNN contexts, MLT directly enables operation at reduced energy, bandwidth, and memory-load levels, thus facilitating deployment on constrained platforms.

6. Significance, Insights, and Recommendations

  • Generalizability: MLT frameworks are adaptable to a variety of architectures, including RNN-T, Conformer-T, Transformer-based ASR, SFL federated learners, and SNNs (Wen et al., 2023, Li et al., 2023, Inaguma et al., 2020, Shinohara et al., 2022, Chowdhury et al., 2021, Yao et al., 2024).
  • Self-Regularisation: Self-regularised alignment and boundary-update mechanisms, when correctly applied, reliably balance latency reduction and accuracy retention without external supervision (Li et al., 2023).
  • Curriculum Approaches: Progressive reduction in SNN simulation steps prevents spike vanishing and unlocks ultra-low latency operation with near-optimal accuracy (Chowdhury et al., 2021).
  • Model-Splitting and Resource Allocation: In federated contexts, joint optimization of model partition points and server resource allocation yields significant latency gains under realistic network and device constraints (Wen et al., 2023).
  • Implementation Practicality: MLT algorithms are directly compatible with existing frameworks (e.g., PyTorch for S2S and SNNs) and require minimal custom components, mostly in loss, attention masking, or curriculum scheduling modules.

7. Limitations and Prospects

MLT is currently bounded by the accuracy-latency Pareto frontier—excessive pressure to minimize delay may induce performance degradation if not adequately regularized. The optimal choice of partitioning schemes for model-splitting, window partition hyperparameters (in SNNs), and thresholds for masking/penalties remains application-dependent. A plausible implication is that future work will focus on adaptive, data-driven tuning of these meta-parameters, new forms of self-regularization and curriculum, and the integration of MLT into broader multi-objective optimization frameworks.

References:

  • "Training Latency Minimization for Model-Splitting Allowed Federated Edge Learning" (Wen et al., 2023)
  • "Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition" (Li et al., 2023)
  • "Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR" (Inaguma et al., 2020)
  • "One Timestep is All You Need: Training Spiking Neural Networks with Ultra Low Latency" (Chowdhury et al., 2021)
  • "Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition" (Shinohara et al., 2022)
  • "Training a General Spiking Neural Network with Improved Efficiency and Minimum Latency" (Yao et al., 2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Minimum Latency Training (MLT).