Latency-Aware Neural Networks
- Latency-Aware Neural Networks are designs that co-optimize accuracy and inference latency by embedding hardware-specific latency predictors into the neural architecture search and training process.
- Latency-aware NAS frameworks and surrogate models use differentiable losses, regression, and GNN predictors to achieve significant real-world speedups, such as a 32% latency reduction on CIFAR-10.
- Advanced techniques like pruning, compression, and dynamic network methods further optimize real-time performance by tailoring operations (e.g., SIMD-structured masks) to meet strict hardware latency constraints.
A latency-aware neural network is a neural architecture that is designed, trained, and/or selected with explicit consideration of its inference latency on a target hardware platform, in addition to conventional metrics such as accuracy. The objective is to co-optimize accuracy and latency, ensuring that the deployed model satisfies stringent real-time, edge, or system-level constraints. Latency-awareness is achieved by integrating latency prediction, measurement, or surrogate models directly into the training or search procedure, enabling differentiated selection or generation of architectures that meet predefined wall-clock requirements on specific devices (Xu et al., 2020, Srinivas et al., 2019, Eriksson et al., 2021).
1. Latency-Aware Neural Architecture Search (NAS)
Latency-aware NAS frameworks incorporate hardware-specific latency estimates into the architectural search process alongside accuracy or task objectives. Classical methods such as DARTS or FBNet are extended by introducing a differentiable (or smooth) surrogate for network latency, which is learned from measured data, and is used as an auxiliary loss term in the NAS optimization.
In "Latency-Aware Differentiable Neural Architecture Search" (Xu et al., 2020), a network encoding (e.g., a binary vector representing operations and connectivity) is mapped to an accurate scalar latency estimate via a small regressor trained on a dataset of sampled architectures with measured hardware latencies. The latency-augmented NAS objective is:
where is the expected predicted latency under the (softmax) distribution induced by architectural weights . This approach admits a fully differentiable, hardware-specific search that uncovers architectures presenting significant improvements in real measured inference latency (up to 32% over the baseline on CIFAR-10), with negligible accuracy loss.
HANNA (Srinivas et al., 2019) utilizes a differentiable NAS method with a latency (and energy) lookup table built by profiling candidate blocks on the actual device, such as a Raspberry Pi. The latency term, parametrized with per-layer Gumbel-Softmax masks, allows continuous relaxation and end-to-end optimization over a balanced accuracy-latency Pareto frontier. Substantial latency reductions (2.5 over MobileNetV2) were achieved, with a controlled decrease in accuracy.
Multi-objective Bayesian optimization is also employed in NAS settings (Eriksson et al., 2021), where Gaussian Processes model both accuracy and on-device latency, enabling direct recovery of the Pareto frontier through hypervolume improvement acquisition. This is particularly beneficial in high-dimensional architectural search spaces, managing trade-offs between accuracy and p99 latency with minimal evaluations.
2. Latency Prediction and Surrogate Modeling
Central to latency-aware design is the development of predictive models that map neural architectures to hardware-specific latency with minimal sample requirements and high robustness across device families and frameworks. Several general strategies have emerged:
Regression-based Surrogates
Network architectures are encoded (as binary operation vectors, graphs, or structural features) and passed to a multi-layer perceptron regressor, as in LA-DNAS (Xu et al., 2020) or FBNet/HANNA (Srinivas et al., 2019). These surrogates are trained on large pools (e.g., 100k) of architectures with measured latencies, achieving relative errors below 5–10%. The models can generalize across devices by retraining on new device data, with each transfer requiring only linear training time in the number of sampled architectures.
Graph Neural Network Predictors
Recent works (NASFLAT) employ two-stage graph neural networks (GNNs) to jointly encode computation graphs, operation types, and hardware IDs (Akhauri et al., 2024). Supplementary global descriptors (e.g., Arch2Vec, zero-cost proxies) further improve generalization and few-shot adaptation. Operation-specific hardware embeddings, cross-device clustering for architecture sampling, and modular transfer learning via Spearman initialization of device embeddings undergird state-of-the-art few-shot latency prediction (22.5% average error reduction, up to 87.6% on hard splits), enabling full hardware-aware NAS loops with fivefold speedup in wall-clock time.
Lookup Table and Layerwise Models
On resource-constrained devices (microcontrollers, embedded CPUs), profiling of candidate operations yields hardware-specific lookup tables for primitive computations (e.g., convolutions of varying shapes) (King et al., 2023). During architecture search, predicted total latency is obtained by summing over per-layer values, with dynamic convolutions enabling fine-grained parameterization.
Prior Knowledge and Transfer
MAPLE-X (Abbasi et al., 2022) extends basic regression by leveraging two priors: (1) hardware-similarity-based importance weighting, and (2) DNN “neighborhood” virtual examples to generate target-specific training data from limited real measurements. MAPLE-Edge (Nair et al., 2022) uses normalized hardware-runtime descriptors (performance counters per-operator normalized by latencies) for fast adaptation to new runtimes with as few as 10 measurement points.
Operation-wise and Analytic Models
Explicitly modeling the latency of each operation type as a function of layer features, followed by summation over the computation graph, provides a highly accurate and sample-efficient approach to latency estimation on mobile, edge, and GPU devices (Li et al., 2022). Gradient-boosted trees (GBDTs), random forests, or even lasso-linear models with hand-selected features achieve 2–10% mean absolute percentage error, reliably supporting NAS and pruning pipelines.
3. Latency-Aware Pruning, Compression, and Dynamic Networks
Pruning and compression methods are directly configured with hardware-specific latency objectives, moving beyond simple FLOPs reduction to enforce real wall-clock speedup.
Methods such as Archtree (Reboul et al., 2023) implement beam search over tree-structured pruning candidates, using on-the-fly latency measurements to closely match predefined latency budgets. The algorithm alternates pruning steps with local fine-tuning and maintains parallel candidate submodels, selecting those with minimal accuracy degradation while meeting latency constraints.
SIMD-structured pruning (Zhao et al., 2021) is matched to hardware vector unit widths, employing group-level l0 regularization and piecewise-linear latency models, and solved via ADMM optimization. Empirical results show new Pareto frontiers, especially for mobile CPUs, with ≈2 speedups at equal or slightly better accuracy compared to channel/pruning baselines.
Depth-compression methods (Kim et al., 2023) use dynamic programming to optimally merge contiguous convolutional layers and prune non-linearities, using block-level latency profiling on hardware (e.g., TensorRT) to guide the optimization. These techniques yield significant speedups (1.4–1.8) without sacrificing accuracy on networks such as MobileNetV2.
Dynamic computation frameworks (e.g., LAUDNet (Han et al., 2023)) explicitly manage execution of spatial, channel, or layer-wise dynamic paths using mask generators, with learned binary gating for selection at coarse or fine granularity. These are merged with analytic/hybrid latency predictors and scheduling optimizations to bridge the gap between theoretical and practical speedup, yielding over 50% reduction in measured latency on modern GPUs at minimal accuracy cost.
4. Latency Awareness in Real-Time and System Hosting Scenarios
Beyond inference-only settings, latency-aware neural networks are integral to systems with strict end-to-end latency requirements, including collaborative and distributed perception, scientific instruments, and automated control.
In collaborative perception for autonomous driving, dynamic feature-level synchronization via learnable modules (e.g., SyncNet (Lei et al., 2022)) compensates for asynchronous feature arrival due to network latencies, leveraging time-modulated recurrent architectures for sequence alignment and attention estimation, leading to substantial robustness gains under severe communication delays.
For high-throughput, low-latency scientific applications (e.g., LHC sensor triggering at 25 ns intervals), the entire network and memory must fit on-chip (Weng et al., 2024). Spatially pipelined or LUT-based custom architectures, combined with logic-level codesign and on-chip memory utilization, are mandatory. Design is fundamentally dictated by bandwidth and compute constraints derived from the per-sample latency budget.
Latency-aware test-time scaling in LLM inference involves concurrent branch-wise and sequence-wise parallelism under fixed wall-clock constraints (Wang et al., 26 May 2025). Hybrid strategies are required to maximize throughput and accuracy within the latency cap, leveraging memory-bound hardware utilization.
5. Design Guidelines, Best Practices, and Practical Implications
Empirical and systematic studies across hardware, methodologies, and application settings have yielded a collection of robust design guidelines for latency-aware neural network development:
- Always profile or measure latency on the target device and runtime, as FLOPs or theoretical proxies routinely fail to capture hardware bottlenecks, quantization effects, kernel fusion, and scheduling.
- Incorporate accurate latency predictors as differentiable loss terms or constraints in NAS, pruning, or compression. Modern GNN-based or regression-based surrogates, enhanced by hardware-aware embeddings and few-shot adaptation, yield low-error guidance for optimization.
- For mobile/embedded/pruned nets, align block or pruning granularity with the hardware’s processing units (SIMD width, memory tiles) and allow for operation fusion to minimize memory/launch overhead.
- In dynamic networks, design mask-generation and gating at a hardware-amenable granularity—coarse spatial/channel gating is essential to achieve practical speedups on modern GPUs.
- Optimize Pareto frontiers between accuracy and latency explicitly, employing multi-objective Bayesian optimization or exhaustive (but efficient) search in ILP/DP frameworks for final deployment selection.
- In collaborative or distributed systems, integrate learnable synchronization/compensation layers to maintain accuracy in the presence of nonzero and variable network-induced latencies.
- For scientific and mission-critical applications with hard real-time requirements, ensure all parameters, activations, and logic fit within the on-chip memory, and prototype via HLS/RTL with overlapping compute/memory pipelines.
Emerging practice emphasizes the closing of the algorithm–system co-design loop: selection or training of neural architectures is performed with explicit and fine-grained knowledge of device- and workload-specific latency profiles, guaranteeing that the resulting models not only meet desired accuracy but are also deployable under fixed latency budgets on realistic hardware (Xu et al., 2020, Srinivas et al., 2019, Akhauri et al., 2024, Zhao et al., 2021, Han et al., 2023).