ExecuTorch: Efficient On-Device LLM Inference
- ExecuTorch is a robust inference engine that deploys PyTorch nn.Module models, including LLMs, on resource-constrained edge devices.
- It utilizes a bifurcated workflow with server-side export and on-device runtime that supports various quantization techniques and specialized backends.
- Empirical benchmarks show a 55% reduction in binary size, a 3.8× throughput increase, and up to 87.7% lower startup latency for efficient on-device operations.
ExecuTorch is Meta-AI’s successor to PyTorch Mobile, designed as an inference engine for efficiently running PyTorch nn.Module models—including LLMs—on resource-constrained edge devices such as smartphones, embedded systems, and NPUs, without requiring Python runtime or hardware-specific code modifications. Its extensible back-end architecture supports deployment on CPU (e.g., via XNNPACK), GPU, NPU, or DSP, and accommodates quantized and group-wise quantized models. ExecuTorch has also been repurposed for on-device fine-tuning of LLMs using gradient-free (zeroth-order) optimization methods, thus enabling private, fully local adaptation of LLMs in edge environments (A et al., 11 Jul 2025, Gao et al., 2024).
1. Architecture and Inference Workflow
ExecuTorch operates in a bifurcated workflow comprising an offline server-side export pipeline and an online edge-side runtime. On the server, a PyTorch model is authored, optionally augmented with parameter-efficient submodules (such as LoRA), traced to capture the forward pass, and compiled into a flatbuffer intermediate representation (ExportIR). This flatbuffer bundles static weights, model topology, and operator graphs. On-device, the ExecuTorch runtime parses the flatbuffer and dispatches its operators—including support for quantized and fused kernels—to the active hardware backend (CPU, NPU, etc.), executing the model in pure inference mode (Gao et al., 2024).
Notably, ExecuTorch does not expose built-in gradient or backpropagation capabilities; all weights are treated as immutable during runtime, with only forward operators supported. This pure-inference model is leveraged and extended by embedding specialized modules that simulate parameter updates through zeroth-order optimization, thereby enabling fine-tuning using only forward passes (Gao et al., 2024).
2. Quantization and Model Compression
ExecuTorch provides intrinsic support for quantized weights and activations, with extensibility to INT4, INT8, and group-wise quantization schemes. A typical pipeline—for example, as demonstrated in the EmoSApp deployment—combines parameter-efficient fine-tuning with quantization-aware training (QAT):
- Weights: 4-bit group-wise quantization (group size 32) is applied to all transformer blocks. For each group , the min/max are computed, scale is set as , and zero-point . Elements are quantized by and dequantized as .
- Activations: Dynamic 8-bit quantization (per tensor) uses min/max observed at runtime and computes , zero-point , quantized as , and dequantized as .
- Special layers: Classification and embedding layers are quantized to 8-bit weights (per channel) and 8-bit dynamic activations (A et al., 11 Jul 2025).
This pipeline enables model compression without prohibitive degradation in accuracy, making deployment feasible on memory-constrained Android devices. For example, a LLaMA-3.2-1B-Instruct model can be quantized to 1.03 GB (from 2.30 GB, a 55% reduction) while sustaining ∼13.5 tokens/sec generation and 0 s startup latency on a Dimensity 7025 smartphone with 8 GB RAM (A et al., 11 Jul 2025).
3. Parameter-Efficient Fine-Tuning and Zeroth-Order Optimization
ExecuTorch is compatible with parameter-efficient fine-tuning methods such as LoRA and LoRA-FA:
- LoRA (Low-Rank Adaptation): For a weight matrix 1, LoRA introduces a low-rank update 2, where 3 and 4, 5. The new weight is 6.
- LoRA-FA: Only the 7 parameter is updated; 8 is frozen after random initialization.
ExecuTorch natively lacks backpropagation, but by embedding custom PyTorch submodules that implement zeroth-order estimators, one can fine-tune locally. The Parallelized Randomized Gradient Estimation (P-RGE) is utilized to estimate gradients from multiple randomized forward passes. For LoRA/LoRA-FA, this involves batching 9 perturbations, computing finite differences, and updating the low-rank parameters via the estimator: 0 where 1 and 2 (Gao et al., 2024).
Dual-forwarding modules encapsulate both inner- and outer-loop parallelization (by duplicating small LoRA parameter sets and sharing heavy backbone parameters), implemented as custom graph operators registered in the ExecuTorch representation. This allows practical on-device fine-tuning of billion-parameter LLMs directly in the inference engine.
4. Deployment and Performance Benchmarks
ExecuTorch enables the deployment of quantized and tuned LLMs on a wide spectrum of Android devices. Empirical results from EmoSApp with LLaMA-3.2-1B-Instruct provide a detailed resource-performance breakdown:
| Configuration | Binary Size (GB) | Tokens/sec | TTFT (s) | RAM Peak (GB) (8GB/6GB/4GB Device) |
|---|---|---|---|---|
| Full FT (BF16) | 2.30 | 3.59 | 46.2 | 6.05 / ✗ / ✗ |
| QAT-LoRA (INT4/8) | 1.03 | 13.5 | 5.69 | 4.79/3.78/✗ |
TTFT: Time-to-First-Token; ✗: model cannot be loaded at that RAM size.
The quantized QAT-LoRA configuration reduces binary size by 55%, increases throughput by 33.8×, and reduces startup latency by 87.7% over the full fine-tuned baseline. On devices with ≤4 GB RAM, even compact quantized models may be unloadable; 6 GB is the practical lower bound for this model size (A et al., 11 Jul 2025).
In reasoning benchmarks, QAT-LoRA yields an average of 46.87% (zero-shot), closely matching the 48.75% of full fine-tuning and improving over the pretrained model's 45.80%. LoRA+PTQ delivered lower accuracy (43.95%) (A et al., 11 Jul 2025).
5. On-Device Inference and Application Integration
ExecuTorch models are distributed as single .pt (flatbuffer) files and integrated using the provided Python or Java/Android API. For example, in the EmoSApp conversational agent, the pipeline involves:
- Loading the model with
executorch.load_model, selecting the XNNPACK backend. - Preparing system prompts and chat history.
- Encoding input tokens and streaming outputs as tokens are generated in autoregressive fashion.
- Integration with Android UI is achieved via simple function calls to trigger on-device inference on user interaction.
A canonical inference code structure is as follows: 8 (A et al., 11 Jul 2025).
6. Extensions: On-Device Fine-Tuning via Zeroth-Order Methods
Recent developments have extended ExecuTorch beyond inference, using parallelized zeroth-order optimization for on-device LLM fine-tuning (Gao et al., 2024). The P-RGE technique, integrated with LoRA-FA, enables training via forward-only passes within the inference graph:
- Each update step consists of parallel evaluation of 4 perturbed parameter sets (outer loop), each efficiently processed in a batched dual-forwarded operator (inner loop).
- Only LoRA parameters are replicated per batch, minimizing memory.
- A custom kernel (e.g., ATen
randn_like) is registered to support stochastic perturbations. - Empirical results show multi-fold end-to-end speedups over naïve MeZO and near parity (or even improvement) in GLUE/SuperGLUE benchmarks, with ∼1.9–4.3× speedups depending on the configuration (Gao et al., 2024).
Resource comparison indicates that full-gradient fine-tuning of Llama2-7B requires ~30 GB for optimizer state and activations, whereas P-RGE+LoRA can operate within 1–2 GB (Gao et al., 2024). Empirical per-step times (e.g., 15.7 s per step at 5, seq 128, on OnePlus 12 NPU, FP16) establish feasibility for interactive local tuning in edge scenarios.
7. Current Limitations and Future Directions
While ExecuTorch substantially lowers the barrier for on-device LLM deployment, certain limitations persist:
- Variance in the zeroth-order gradient estimator scales as 6, requiring larger 7 for high-precision fine-tuning, which increases computational burden.
- To enable more sophisticated perturbation schemes or compressed formats, the registration of additional custom kernels may be necessary.
- Model execution for devices with <6 GB RAM remains infeasible for LLMs at the billion-parameter scale using current quantization strategies.
- Expansion of backend support (Vulkan, ROCm) and integration of further quantization formats (INT8, NF4) are cited as future priorities (Gao et al., 2024).
In summary, ExecuTorch is a robust and extensible inference engine for quantized LLM deployment and, via embedded zeroth-order optimization, supports efficient on-device fine-tuning—all within the operational and memory boundaries of contemporary mobile and embedded hardware (A et al., 11 Jul 2025, Gao et al., 2024).