Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load

Published 24 Mar 2026 in cs.DC and cs.LG | (2603.23640v1)

Abstract: Deploying LLMs on-device for always-on personal agents demands sustained inference from hardware tightly constrained in power, thermal envelope, and memory. We benchmark Qwen 2.5 1.5B (4-bit quantised) across four platforms: a Raspberry Pi 5 with Hailo-10H NPU, a Samsung Galaxy S24 Ultra, an iPhone 16 Pro, and a laptop NVIDIA RTX 4050 GPU. Using a fixed 258-token prompt over 20 warm-condition iterations per device, we measure throughput, latency, power, and thermal behaviour. For mobile platforms, thermal management supersedes peak compute as the primary constraint: the iPhone 16 Pro loses nearly half its throughput within two iterations, and the S24 Ultra suffers a hard OS-enforced GPU frequency floor that terminates inference entirely. On dedicated hardware, distinct constraints dominate: the RTX 4050 is bounded by its battery power ceiling, while the Hailo-10H is limited by on-module memory bandwidth. The RTX 4050 sustains 131.7 tok/s at 34.1 W; the Hailo-10H sustains 6.9 tok/s at under 2 W with near-zero variance, matching the RTX 4050 in energy proportionality at 19x lower throughput. Results should be interpreted as platform-level deployment characterisations for a single model and prompt type, reflecting hardware and software combined, rather than general claims about hardware capability alone.

Summary

  • The paper demonstrates that thermal and power constraints, rather than peak compute, primarily govern sustained LLM inference performance.
  • It evaluates four platforms using a unified 258-token prompt, measuring throughput, latency, energy, and temperature to reveal distinct hardware bottlenecks.
  • The study highlights that edge NPUs offer stable, energy-efficient inference for background tasks, while mobile devices suffer from rapid thermal throttling in always-on scenarios.

LLM Inference at the Edge: Performance-Efficiency Trade-offs and Thermal Constraints

Overview

The paper "LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load" (2603.23640) systematically benchmarks sustained autoregressive inference for a quantized 1.5B-parameter LLM (Qwen 2.5 1.5B, 4-bit) across four edge-representative platforms: Raspberry Pi 5 equipped with Hailo-10H NPU, Samsung Galaxy S24 Ultra, iPhone 16 Pro, and NVIDIA RTX 4050 laptop GPU. The study targets always-on agent deployment scenarios in which inference must be sustained under continuous load, revealing how power, thermal envelopes, and memory bandwidth serve as primary constraints, superseding peak compute specifications. The evaluation exposes distinct platform bottlenecks, quantifies throughput, power, and thermal stability, and characterizes the real-world feasibility of LLM deployment at the edge.

Methodology and Benchmarking Design

The authors select Qwen 2.5 1.5B with 4-bit quantization due to its sub-1GB memory footprint and wide framework support, enabling controlled comparisons. Each platform receives a unified 258-token prompt eliciting extended, structured output to stress sustained decode throughput and thermal management. Benchmarks are executed under warm conditions (model pre-loaded, 20 back-to-back iterations per device), with metrics including sustained throughput (tok/s), latency, power consumption, and temperature. Measurement instrumentation is platform-specific: direct current sensing for Hailo-10H, GPU-level power via nvidia-smi for RTX 4050, battery proxy for iPhone, and temperature logs for all platforms. Iteration-level results are validated for anomalies, and cold-start overhead is measured but excluded from final statistics.

Platform-Specific Results

RTX 4050 Laptop GPU

The battery-powered RTX 4050 achieves the highest sustained throughput at 131.7 tok/s (CV = 2.2%), operating at a stable average power of 34.1 W. GPU temperature increases gradually (55°C → 70°C), but no throttling is observed. Energy per token stabilizes at 297 mJ. Figure 1

Figure 1: RTX 4050 per-run throughput and temperature across 20 sustained inference runs; stable performance and no thermal throttling.

Raspberry Pi 5 + Hailo-10H NPU

Hailo-10H delivers 6.9 tok/s at 1.87 W, exhibiting near-zero throughput variance (CV = 0.04%) and thermally stable operation (CPU/NPU 52.7/58.5°C). The result matches vendor reference figures, confirming memory-bandwidth-bound deployment as the primary limiter. Energy per token is lowest among platforms at 271 mJ, comparable to RTX 4050 on a per-joule basis. Figure 2

Figure 2: Hailo-10H NPU throughput and temperature over 20 iterations; deterministic and thermally stable inference.

iPhone 16 Pro

Peak throughput reaches 40.35 tok/s but degrades to a sustained 22.56 tok/s (Hot thermal state, –44% from peak) after only two iterations, with 65% of runs executed in a throttled state. Battery drain is 10% over 20 iterations; power per inference cannot be directly measured. Figure 3

Figure 3: iPhone 16 Pro per-iteration throughput, with rapid thermal state transitions and sustained Hot-state degradation.

Samsung Galaxy S24 Ultra

S24 Ultra sustains 9.9 tok/s across only five valid iterations before Android's thermal governor enforces a hard GPU frequency floor, terminating inference (GPU temperature 78.3°C). Pre-fill time is anomalously high (25.1 s), likely due to software stack overhead. The device is unsuited for sustained agent operation. Figure 4

Figure 4: S24 Ultra per-iteration throughput; hard OS-enforced frequency floor and benchmark termination due to severe thermal event.

Cross-Platform Analysis

Among all platforms, the RTX 4050 delivers 19×19\times higher throughput than Hailo-10H and 5.8×5.8\times that of iPhone 16 Pro (sustained). Both RTX 4050 and Hailo-10H demonstrate comparable energy proportionality (∼0.26 W/tok/s), confirming high efficiency at vastly different power envelopes. Mobile platforms (iPhone, S24 Ultra) are notably constrained by rapid thermal throttling, precluding always-on deployment even at modest query rates. Hailo-10H is unique in its ability to deliver deterministic, thermally stable inference suitable for asynchronous or background workloads, albeit with a significant latency limitation (∼72 s for 500 tokens).

Implications for Deployment

Thermal Constraints Dominate Mobile Edge

Empirical evidence confirms that sustained LLM inference on mobile SoCs is primarily limited by thermal management rather than peak compute. Rapid throughput degradation (iPhone Hot state), abrupt throttling (S24 Ultra), and lack of thermal recovery within short cooldown intervals pose significant barriers for always-on, high-frequency agent deployments. Interactive scenarios require either significant duty-cycling or external cooling strategies to avoid reliability degradation.

Dedicated Edge NPUs as Viable Alternative

Edge NPUs like Hailo-10H offer unmatched stability and efficiency for sustained workloads, supporting indefinite operation within tight power/thermal envelopes. However, throughput remains substantially lower than other platforms, restricting their applicability to non-interactive or background tasks unless deployment optimizations (batched decoding, host-NPU integration) improve per-token latency.

GPU Edge Viability and Battery Constraints

Laptop-grade GPUs (RTX 4050) are optimal for high-throughput, thermally stable inference given sufficient power sources. Battery-powered sustained agent operation is infeasible due to drain rates (∼12% per 20 runs), limiting practical deployment to AC-powered or episodic scenarios.

Framework and Measurement Limitations

Performance differences reflect combined hardware-software effects. Contrasts in quantization formats, inference stacks, and measurement methodologies constrain generalizability. Android/iOS do not expose reliable per-component power metrics, precluding cross-platform energy analysis. Only a single model and prompt were evaluated; results may not extrapolate to shorter outputs or smaller models.

Future Research Directions

The study calls for extended-duration thermal profiling (100+ iterations), standardized power instrumentation (hardware-level sensors across platforms), exploration of batching/speculative decoding on edge NPUs, quantization format unification, and coverage of diverse models and prompt types. Investigating software stack optimizations and hybrid agent architectures combining mobile SoCs with lightweight NPUs will further delineate the practical landscape for edge LLM inference.

Conclusion

Sustained autoregressive LLM inference at the edge is fundamentally governed by thermal and power constraints, not nominal compute capacity. Dedicated edge NPUs (Hailo-10H) exhibit unparalleled stability and efficiency at low throughput, while mobile flagships (iPhone, S24 Ultra) fail to sustain inference due to rapid or abrupt thermal throttling. GPU-edge devices (RTX 4050) lead in throughput but impose battery limitations. These findings underscore the necessity for platform-aware deployment strategies, including agent duty-cycling, hardware integration, and careful workload engineering for edge LLMs. Robust benchmarking standards and instrumentation are essential for future comparative studies.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 42 likes about this paper.