EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices

Published 10 Apr 2026 in cs.OS and cs.DC | (2604.09083v1)

Abstract: Deploying LLMs on mobile devices is an emerging trend to enable data privacy and offline accessibility of LLM applications. Modern mobile neural processing units (NPUs) make such deployment increasingly feasible. However, existing mobile LLM inference frameworks suffer from high start-up latency due to their inevitable cold starts, i.e., launching LLM inferences when the model is not hosted in device memory. In this paper, we identify the key bottleneck of mobile LLM cold starts as the waste of flash bandwidth on unimportant model parameters. We design EdgeFlow, a mobile LLM inference framework that mitigates the cold start issue by adaptively adjusting the precisions of LLM parameters. Specifically, EdgeFlow leverages 1) an NPU-aware adaptive quantization algorithm that assigns different precisions to weights in a finer granularity according to their importance and NPU constraints, 2) an SIMD-friendly packing format that accelerates the transformation of various-precision weights into fixed-sized NPU-native data types, and 3) a synergistic granular pipeline that coordinates CPU and NPU computation in a fine-grained and dynamic manner. Experimental results show that EdgeFlow reduces cold-start latency by up to 4.07x compared with three state-of-the-art mobile LLM inference frameworks, i.e., llama.cpp, MNN, and LLM.npu, under comparable model accuracy.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces EdgeFlow, a co-designed system that applies NPU-aware adaptive quantization, SIMD-friendly packing, and fine-grained pipeline scheduling to reduce mobile LLM cold start latency.
It achieves up to 4x faster first-token times across models (e.g., Llama3 8B, Mistral 7B) while maintaining near-lossless accuracy compared to INT8 baselines.
Empirical evaluations on devices like the Xiaomi 15 Pro demonstrate significant latency improvements and energy savings, making mobile LLM deployments more feasible.

EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices

Motivation and Systemic Bottlenecks of Mobile LLM Cold Starts

EdgeFlow addresses a critical challenge in on-device LLM inference for mobile platforms: cold-start latency, defined as the time to first token (TTFT) when the LLM weights are not resident in device memory. Empirical profiling reveals that the cold start process on state-of-the-art frameworks such as LLM.npu is dominated by two components: weight loading from flash storage and computation graph preparation. Even with optimized techniques like materialization (offline graph compilation) and pipeline overlapping, TTFTs remain well beyond acceptable human-interaction thresholds (e.g., over 9s for Llama3 8B, considerably above the 7s user patience limit) (Figure 1).

Figure 1: Breakdown of the cold-start latencies of LLM.npu and two straightforward optimizations, materialization and overlapping.

Efforts to reduce latency through aggressive preloading are constrained by mobile device memory budgets, where satisfying TTFT SLOs (~7s) could require reserving up to 4GB, which is prohibitive for many devices (Figure 2). Thus, a reduction in transferred weights—not merely I/O scheduling—is essential.

Figure 2: TTFT v.s. pre-loaded data size for LLM.npu, showing user-acceptable latency region and EdgeFlow’s measurements.

Technical Innovations: System-Quantization Co-Design

EdgeFlow introduces three synergistic techniques, explicitly co-designed with the constraints of NPUs and mobile storage systems:

NPU-aware adaptive quantization: Unlike uniform quantization (INT8 for all weights), EdgeFlow applies per-output-channel precision assignment, subject to strict NPU requirements (symmetric, uniform per-channel quantization). The algorithm selects bit-widths for each channel based on an efficient, local relative error metric grounded in cosine distance approximations, enabling greedy, optimal allocation to fit a global bandwidth budget.
SIMD-friendly packing and unpacking: Mixed-precision tensors present considerable unpacking overheads, especially with non-byte-aligned weights (e.g., 3- or 5-bit). EdgeFlow proposes a weight packing format that decomposes each weight into “weightlets” of 1, 2, or 4 bits and stores them in an interleaved SIMD-parallel manner (Figure 3). Unpacking utilizes SIMD masking, shifting, and merging, achieving an average of 0.48 SIMD instructions per unpacked weight (Figure 4).
Figure 3: SIMD-friendly packing format. The index in each weightlet corresponds to the weight’s channel.

Figure 4: SIMD-based unpacking algorithm for heterogeneous bit-precision weights.
Synergistic granular pipeline: EdgeFlow refines operator placement and scheduling between the NPU and CPU. All INT8 matrix multiplications are dynamically dispatched to the NPU, while element-wise, low-arithmetic-intensity operators are placed on the CPU. The pipeline utilizes position-guided task prioritization and CPU-side task stealing to minimize idle (bubble) time and correct load imbalance, which is prevalent in static, coarse-grained schedules (Figure 5).
Figure 5: The synergistic granular pipeline with fine-grained operator placement and dynamic operator scheduling; each block is an operator, chunked for parallel scheduling.

Empirical Evaluation and Ablation

EdgeFlow is implemented with full stack visibility (custom Python quantization utilities, C++ runtime, direct QNN API integration for dynamic weight injection), running on Xiaomi 15 Pro with Hexagon NPU. Comprehensive evaluation covers Llama3 8B, Mistral 7B, Phi3 3.8B, and Qwen1.5 1.8B, spanning a wide prompt-length distribution (LAMBADA, WinoGrande, OBQA, MMLU, HellaSwag).

Significant speedups are demonstrated for cold TTFT:

At 7-bit average quantization, EdgeFlow achieves $3.92\times$ , $2.28\times$ , $1.47\times$ TTFT reduction over llama.cpp, MNN, and LLM.npu, respectively.
At 4 bits, TTFT improvements reach $4.24\times$ , $2.53\times$ , $1.63\times$ .
Even at long prompts, the pipeline’s operator scheduling yields $1.37\times$ – $1.41\times$ speedup beyond the best NPU-optimized baseline (LLM.npu).

Crucially, these latency gains are attained while maintaining equivalent or improved accuracy compared to INT8 baselines (EdgeFlow’s 5-bit configuration shows a negligible —0.03%—accuracy drop to LLM.npu). The adaptive quantization outperforms both uniform INT4 and competitor mixed-precision schemes (AWQ, CMPQ) in top-1 accuracy and perplexity, especially at low bit-widths (Figure 6, Figure 7).

Figure 8: Cold start latency (TTFT) and accuracy of different methods on various models and datasets.

Figure 6: Accuracy of quantization schemes across precisions on Llama3 8B; dotted line: FP16 accuracy.

Figure 7: Accuracy of quantization schemes across precisions on Phi3 3.8B; dotted line: FP16 accuracy.

EdgeFlow also demonstrates state-efficient unpacking/packing, balancing I/O and compute for minimum total pipeline blocked time (Figure 9), and a detailed breakdown shows energy savings of 12.9% against the closest NPU-based baseline with only a minor increase in mean power draw (Figure 10).

Figure 9: Performance comparison of different storage formats (EdgeFlow vs. K-Quant and INT4/INT8 mixed).

Figure 11: Breakdown of end-to-end completion latency.

Figure 10: Comparison of resource consumption during the cold start phase of Mistral 7B with 512 tokens.

Implications and Future Directions

EdgeFlow demonstrates that co-designing quantization algorithms and system-level scheduling/packing, cognizant of NPU hardware and mobile storage constraints, can dramatically reduce TTFT for LLMs on mobile platforms, with no substantial loss in model accuracy or decoding throughput.

Practical implications include:

Broader applicability: The quantization and system organization methods are immediately portable to other NPUs (including MediaTek NeuroPilot and Apple Neural Engine), supporting deployment diversity in mobile ecosystems.
Orthogonality to other approaches: EdgeFlow’s pipeline and quantization designs are additive to emerging storage-loading and compute optimization techniques (e.g., neuron rematerialization [mobicom25elms], flash-aware bundling [acl24llmflash], speculative decoding [tmc25llmcad], and dynamic input pruning [mlsys25dip]).
Model scaling: By minimizing memory footprint and I/O, EdgeFlow makes larger LLMs and multi-modal models (e.g., BlueLM-V-3B [cvpr25bluelm], MiniCPM-V [arxiv24minicpm-v]) feasible on commodity mobile devices.

Theoretically, the results validate the hypothesis that weight importance is a local property suitable for efficient mixed-precision assignment, even under severe hardware constraints, and that fine-grained pipeline scheduling is required to leverage NPU acceleration without creating new bottlenecks.

Future work should explore adaptive quantization in conjunction with run-time model adaption (dynamic bitwidth adjustment under real-time energy or latency budgets), tight integration with OS-level data prefetch and memory management, and extending SIMD-friendly packing schemes for more diverse hardware-software stacks.

Conclusion

EdgeFlow establishes a robust, empirical foundation for latency-optimized on-device LLM inference, substantiating that system-level and quantization algorithmic co-design can achieve up to $4.07\times$ faster cold starts with comparable or improved accuracy on modern mobile SoCs. The techniques introduced help shift on-device LLM applications from proof-of-concept toward deployability at interactive latency, broadening the frontier for privacy-preserving, always-available AI on ubiquitous personal devices (2604.09083).