MobileLLM-Pro: Efficient On-Device LLM Suite

Updated 17 November 2025

MobileLLM-Pro is a one-billion-parameter language model suite optimized for efficient, low-latency on-device inference and fine-tuning on mobile devices.
It leverages a transformer architecture with rotary positional embeddings and grouped-query attention to support context windows up to 128,000 tokens.
Innovative training phases, specialist model merging, and quantization-aware strategies deliver state-of-the-art performance with minimal accuracy loss.

MobileLLM-Pro is a one-billion-parameter LLM suite optimized for efficient, low-latency inference and fine-tuning directly on resource-constrained mobile and wearable devices, while supporting context windows up to 128,000 tokens and exhibiting minor performance loss under INT4 quantization. It integrates advanced training, model merging, data mixing, quantization, and privacy-aware deployment methods designed for practical on-device AI applications.

1. Model Architecture and Attention Mechanisms

MobileLLM-Pro features a transformer-based backbone comprising 1.08 billion parameters distributed across 30 transformer blocks, with a hidden dimension of 1,280 and a feed-forward expansion factor of 4.8× (inner size 6,144). The vocabulary embedding is shared over 202,048 tokens, yielding a parameter savings of approximately 260 million. Multi-head attention is implemented with 20 heads (each size 64) via grouped-query attention, using 4 KV heads per query group to reduce memory and compute overhead. Context window support is extended to 128,000 tokens by interleaving local (512-token windowed) and global attention layers in a 3:1 ratio—with the initial and final layers set to global attention.

The attention span extension exploits the rotary positional embedding (RoPE) mechanism. RoPE is applied to every pair of position indices $(i, i+1)$ according to: $x′ = \begin{bmatrix} \cos(\theta_i\cdot p) & -\sin(\theta_i\cdot p) \ \sin(\theta_i\cdot p) & \cos(\theta_i\cdot p) \end{bmatrix} \cdot x, \quad \theta_i = 10000^{-2i/d}, \quad p = \text{absolute position}$ The embedding design and layer-wise arrangement yield highly scalable inference with bounded memory consumption, suitable for mobile hardware constraints.

2. Training Strategy and Core Innovations

MobileLLM-Pro training is divided into four distinct phases:

Phase 1 (Language Acquisition and Data Mixing): The model is trained on 1.4T tokens in batches of 2M tokens, with logit distillation from Llama 4-Scout and a dynamic learning rate schedule. Data sampling weights per domain are set by the Scalable Data Mixer, which estimates domain utility via per-sample CE loss improvements tracked by a small regressor over influence vectors.
Phase 2 (Context Expansion): Starting from the Phase 1 checkpoint, the model undergoes 100k updates (20B tokens) with local-global attention and implicit positional distillation—the latter leveraging knowledge distillation of logits from a long-context teacher without exposing the student to actual long documents.
Phase 3 (Specialist Model Merging): The Phase 2 model forks into $n$ specialists ( $n=4$ –$5$), each further fine-tuned on $\sim$ 60M domain-specific tokens and merged by weighted averaging: $\theta_{\text{Merged}} = \sum_{b=1}^n w_b \theta_b, \quad \sum_b w_b = 1$ Weights $w_b$ are empirically determined for optimal validation set performance.
Phase 4 (Quantization-Aware Training and Self-Distillation): The quantization process uses INT4 weights and, depending on execution target, INT8 (CPU) or BF16 (accelerator) activations. Group-wise (CPU, group size 32) or channel-wise (accelerator) weight quantization is performed dynamically in the training loop: $Q(W_g) = s_w \cdot \text{clip}(\text{round}(W_g / s_w), -8, 7)\ s_w = \frac{1}{7.5} \max(|w_{\min}|, |w_{\max}|)$ Variants include "vanilla" (fixed min/max) or "learnable" (min/max as parameters). Self-distillation combines cross-entropy and KL divergence loss terms: $L = \alpha \cdot CE + \beta \cdot KL(p_{FP} \| p_{INT4})$

3. Specialist Model Merging and Data Mixing

The specialist model merging framework orchestrates the fusion of multiple domain experts into a single model of fixed size. Each branch starts from a shared checkpoint, is exposed to a short burst (∼500 steps) of domain-specific data, and is averaged into the base via non-uniform merging weights.

Simulation-driven data mixing ensures effective data diversity and task performance. The Scalable Data Mixer creates a static weighting over domains based on the mean utility prediction $\hat{y}(x)$ from the regressor applied to per-domain influence vectors. This produces a representative training set composition (e.g., approximately 90% web, 5% code), optimizing generalization for mobile SE and UX tasks.

4. Quantization-Aware Training and Deployment

MobileLLM-Pro employs in-loop quantization-aware training compatible with both CPU and mobile AI accelerator hardware. Group-wise INT4 quantization for CPU inference and channel-wise for accelerators is combined with dynamic asymmetric INT8/BF16 activation quantization. The quantization formula for activations is given by: $Q(x) = s_x \cdot \text{clip}(\text{round}((x - z)/s_x), 0, 255) + z$ with $s_x = (\max - \min)/255$ and $z = \min$ . Performance regressions under INT4 quantization remain minor: empirical average drops are 0.73% (INT4-CPU) and 1.3% (INT4-accel) across all benchmarks. On-device model footprints are 2.2 GB (BF16), 590 MB (INT4-CPU), and 720 MB (INT4-accel).

Inference is supported via ExecuTorch: xnnpack for CPU and HTP backend for Qualcomm Hexagon DSPs. Representative latencies (Galaxy S25 CPU / S24 HTP) are 8.9 s vs. 2.0 s for 2K tokens prefill and $\approx$ 32 tokens/s decode; the KV cache incurs a $\sim$ 40 MB footprint for 8K context. These resource metrics position MobileLLM-Pro within the operational capabilities of contemporary flagship mobile hardware.

5. Benchmark Evaluation and Instruction Fine-Tuning

MobileLLM-Pro achieves state-of-the-art results across 11 standard pre-training benchmarks, consistently outperforming both Gemma 3-1B and Llama 3.2-1B (gains of 4–6 points in most tasks). Outstanding results include HellaSwag (67.11 vs. 62.30/65.69), BoolQ (76.24 vs. 63.20/62.51), PIQA (76.55 vs. 73.80/75.14), and NIH long retrieval (100.00 vs. –/96.80).

Instruction fine-tuning, performed over 7.64M annotated requests (three stages: diversity-first, leave-one-out reweighting, safety+DPO), yields consistent wins over baselines for coding (MBPP, HumanEval), QA, function calling, rewriting, and summarization. Competitiveness is maintained on general benchmarks such as MMLU and IFEval; ablation studies confirm minor regressions under low-bit quantization.

MobileLLM-Pro incorporates privacy-preserving mechanisms throughout its pipeline. Quantization-aware training, as described in MobiLLM (Li et al., 27 Feb 2025), enables fundamentally private fine-tuning by transmitting only quantized intermediate activations (no raw data or gradients) to server-side side-networks. Secure enclave execution (e.g., via ARM TrustZone) and differential privacy (DP) noise on activations or gradients are proposed for formal privacy guarantees.

For multi-modal fusion, unified encoders combine textual, visual, and sensor features: $h = W_t t + W_v v + W_s s, \quad \text{output} = \mathrm{LLMDecoder}(h)$ This supports on-device multi-modal tasks and runtime monitoring. Additional security measures encompass lightweight enclave emulation, model obfuscation, dual-key watermarks, and anomaly detection via DP-aware log filtering.

7. Strategic R&D Roadmap and Applications

A phased strategic plan for MobileLLM-Pro aligns with six research directions in mobile-LLM development (Chen et al., 9 Jul 2024):

Phase	Milestone	Metric/Output
Data Collection	On-device/federated aggregator	≥10M mobile snippets/UX logs
Core Model/Compression	LoRA adapters, 4-bit QAT pipeline	≤200 MB model, ≤80 ms/token, ≤1% accuracy drop
Security & API	Secure enclave, obfuscator, SDK release	≤5% IP extraction, <5 ms API overhead
Monitoring & UX	LLM runtime monitor, pilot UX integration	F1 ≥0.9 anomaly, user satisfaction ≥4.2/5
Multi-Modal+Federated	Multi-modal LLM, DP federated fine-tuning	≥85% multi-modal acc., ε≤1 privacy budget

On-device applications encompass assistants, code generation, function calling, summarization, retrieval, and UX monitoring, under stringent compute/memory/energy budgets. The SDK delivers accessible APIs (e.g., LLMManager.loadModel) to app developers, abstracting hardware-specific deployment and dynamic tuning challenges.

A plausible implication is that as mobile hardware evolves, further advancements in model sparsity, hardware-aware quantization, secure deployment, and federated learning are expected to extend MobileLLM-Pro's capabilities, enabling increasingly sophisticated private AI services on ubiquitous smart devices.

PDF Markdown Chat (Pro)

References (2)

MobiLLM: Enabling LLM Fine-Tuning on the Mobile Device via Server Assisted Side Tuning (2025)

LLM for Mobile: An Initial Roadmap (2024)

Follow Topic

Get notified by email when new papers are published related to MobileLLM-Pro.