On-Device Fine-Tuning: Challenges & Advances
- On-device fine-tuning is the process of adapting pre-trained machine learning models directly on edge devices, allowing personalized, context-sensitive performance with minimized communication overhead.
- Techniques like LoRA and parameter-efficient fine-tuning reduce memory and compute demands by training only a small fraction of parameters, enabling adaptation on resource-constrained hardware.
- Integrating federated learning and differential privacy, these approaches secure sensitive data while balancing trade-offs between accuracy, memory consumption, and computational cost.
On-device fine-tuning refers to the adaptation or personalization of machine learning models—particularly deep neural networks and LLMs—directly on edge devices such as smartphones, IoT nodes, microcontrollers, and embedded hardware. This paradigm brings learning closer to user or deployment data, enabling privacy-preserving, low-latency, and context-sensitive model adaptation under stringent hardware, bandwidth, and privacy constraints. On-device fine-tuning intersects research in parameter-efficient adaptation, memory and compute-efficient training, privacy-enhanced collaboration, federated learning, integer-only optimization, and system-level compilation for diverse hardware. The following sections review the core methods, system strategies, efficiency mechanisms, privacy techniques, and deployment guidelines that define the field.
1. Technical Challenges in On-Device Fine-Tuning
On-device fine-tuning is fundamentally limited by three intertwined bottlenecks:
- Memory and Storage Constraints: Standard full-model adaptation or even vanilla parameter-efficient fine-tuning (PEFT) protocols (e.g., LoRA, adapters) require storing all intermediate activations and optimizer states during backpropagation. For a moderately sized LLM (e.g., OPT-1.3B with batch size 16 and sequence 256), peak memory exceeds 20 GB—far beyond the 4–12 GB DRAM typical on mobile or edge hardware (Li et al., 27 Feb 2025).
- Computation and Hardware Mismatch: Classical training workflows require substantial matrix multiplications, gradient computation, and custom backwards kernels, while mobile NPUs and DSPs are typically optimized for forward inference only, lacking dedicated backward support (Li et al., 27 Feb 2025).
- Communication and Privacy: In federated or collaborative setups, exchanging large parameter or activation blobs strains uplinks and exposes sensitive local data or adaptation signatures (Xu et al., 11 Sep 2025, Wagner et al., 2024). There is also inherent heterogeneity in device capabilities and data distributions (Cho et al., 2024, Fang et al., 31 Jan 2025).
These challenges motivate a rich vein of algorithmic and systems-level innovations, as detailed below.
2. Memory- and Compute-Efficient Adaptation Methods
2.1 Parameter-Efficient Fine-Tuning (PEFT)
LoRA and Adapters: The canonical approach is to freeze the majority of the pre-trained model's weights and train only a small, usually low-rank, set of additional parameters. For a linear transformation , LoRA introduces , , (), thus reducing trainable and communicable parameters from to per layer (Wang et al., 10 Mar 2026).
Selective Layer/Node Updates: Freeze And Reconfigure (FAR) (Vucetic et al., 2022) and Layer-Cyclic Selective Backpropagation (LCSB) (Park et al., 13 Feb 2026) further reduce activation and backpropagation costs by identifying and updating only a small fraction of model components either through “learner node” priming (FAR) or stochastic layer selection at each iteration (LCSB). PockEngine additionally enables sparse backpropagation via compile-time graph pruning (Zhu et al., 2023).
2.2 Structured and Quantized Adaptation
Memory-Efficient Structured Backpropagation (MeSP): By exploiting LoRA’s low-rank structure, intermediates can be recomputed on-demand in backward passes, which drastically reduces per-layer activation storage requirements while preserving exact first-order gradients (Park et al., 13 Feb 2026). This yields up to 62% memory reduction compared to gradient checkpointing (Park et al., 13 Feb 2026).
Integer and Quantized Training: GSQ-Tuning (Zhou et al., 18 Feb 2025) replaces all floating point operations in both forward and backward passes with group-shared exponent integer (GSEI) arithmetic, enabling integer-only on-device fine-tuning with substantial reductions in power, memory, and chip area, while maintaining competitive accuracy.
2.3 Efficient Backpropagation Avoidance
Zeroth-Order Optimization (MeZO, P-RGE): Fine-tuning without storing intermediates is enabled by zeroth-order gradient estimation (e.g., MeZO (Katti et al., 14 Nov 2025), P-RGE (Gao et al., 2024)), which replaces backward passes with forward-only finite-difference computations over random perturbations. These approaches permit models up to 2x larger to fit within the same device memory, at the expense of increased wall-clock time for convergence.
2.4 Lightweight Architectures for Edge
Skip2-LoRA and LoRA-Edge: Skip2-LoRA (Matsutani et al., 2024) utilizes adapter placement and forward activation caching to minimize redundant computation, achieving 90% reduction in wall-clock time for fine-tuning on low-cost ARM microcontrollers. LoRA-Edge (Kwak et al., 5 Nov 2025) applies tensor-train SVD decompositions to CNN weights, updating only output-side TT cores, resulting in two orders of magnitude reductions in trainable parameters and accelerated on-device adaptation.
3. System and Collaboration Strategies
3.1 Split and Side-Tuning Architectures
Server-Assisted Fine-Tuning (MobiLLM, PAE MobiLLM):
- MobiLLM (Li et al., 27 Feb 2025): The device performs forward passes through a frozen backbone, transmits low-bit quantized activations to a server, which executes backpropagation over a side-network of adapters. Backpropagation through the backbone is bypassed, significantly lowering device memory usage ($4.5$ GB vs $14.6$ GB on OPT-1.3B) and converging in half the time of LoRA (Li et al., 27 Feb 2025).
- PAE MobiLLM (Yang et al., 1 Jul 2025): Further reduces device and communication cost by sending only pivot-token activations and privacy-preserving label differences once per sample, followed by persistent server-side adapter training using cached activations. This approach reduces device FLOPs by and communication by relative to side-tuning baselines, with negligible accuracy loss.
Split Learning (FlexP-SFL):
- Flexible Personalized Split Federated Learning (Yuan et al., 14 Aug 2025) allows each client to process a configurable share of model layers, offloading the remainder to the server. Alignments via KL-based regularizers improve global representations. This asynchronous and communication-efficient protocol circumvents straggler and memory bottlenecks characteristic of classic federated schemes.
3.2 Federated and Collaborative Fine-Tuning
LoRA-based Federated Learning: In federated setups, LoRA minimization of trainable parameter counts enables low-latency, privacy-preserving updates:
- DP-FedLoRA (Xu et al., 11 Sep 2025): Integrates LoRA with per-client differential privacy, enforcing -DP via Gaussian-perturbed (clipped) updates, and achieves competitive performance with strong privacy guarantees.
- Personalized Collaborative Fine-Tuning (Wagner et al., 2024): LoRA updates are combined across devices using trust-weighted aggregation (based on validation loss or prediction similarity), outperforming traditional FedAvg, especially under data heterogeneity.
- HeteroLoRA and Federated Sketching LoRA (FSLoRA) (Cho et al., 2024, Fang et al., 31 Jan 2025): Address device heterogeneity by letting each device select LoRA adapter rank adaptively (HetLoRA) or update only a submatrix (“sketch”) of LoRA parameters according to local resources (FSLoRA), with convergence and communication guarantees.
Label Correction for On-Device Recommenders: Local data prior drift (e.g. per-user CTR) can impair ranking after fine-tuning. Recalibrating local samples' labels to match global priors restores ranking consistency and yields measurable gains in both experimental and real-world deployments (Ding et al., 2022).
4. Privacy Preservation and Differential Privacy
Local and Federated Privacy Mechanisms: Differentially private training has been successfully integrated with LoRA-based PEFT in both cross-device (DP-FedLoRA (Xu et al., 11 Sep 2025)) and dynamic rank settings (DP-DyLoRA (Xu et al., 2024)), overcoming the severe utility drop observed in full-model DP-FL. Rank randomization is managed globally to retain sensitivity calibration, and adaptive rank selection strikes a signal-to-noise balance, yielding <2% accuracy drop even with and $1$M clients.
Activation and Label Masking: PAE MobiLLM (Yang et al., 1 Jul 2025) employs privacy-masked prediction differences and random nonces, ensuring that the server never receives true labels or raw activations. Only device-defined side-network computations are disclosed, with formal mechanisms supporting further DP or secure multi-party computation overlays.
Collaborative Learning with Privacy: Efficient communication protocols exchange only low-rank LoRA updates or distilled signal aggregators (trust weights, logits), preserving data locality and minimizing exposure in collaborative or federated settings (Wagner et al., 2024, Fang et al., 31 Jan 2025).
5. Hardware and Software System Integration
Hardware-Accelerated Fine-Tuning
TrainDeeploy (Wang et al., 10 Mar 2026) demonstrates the first end-to-end on-device transformer/CNN fine-tuning pipeline on RISC-V SoCs and PockEngine (Zhu et al., 2023) provides compilation of full training graphs to diverse mobile/edge backends. Integer-only pipelines (GSQ-Tuning (Zhou et al., 18 Feb 2025)), and quantization techniques (OTARo (Chen et al., 17 Nov 2025)) enable further deployment on devices with only integer ALUs, with up to area and power savings vs. FP8.
Inference-Only Runtime Adaptation
Recent systems (Gao et al., 2024 (Gao et al., 2024)) show that zeroth-order optimization over LoRA-augmented parameters can be executed using unmodified inference-only runtimes such as ExecuTorch, leveraging dual-forwarding modules and parallelized randomized gradient estimation for tractable real-time adaptation without runtime code changes. This unlocks on-device fine-tuning with production inference engines, with substantial memory and speed improvements over traditional backpropagation.
6. Practical Trade-offs, Performance, and Deployment Guidelines
- Memory vs. Compute vs. Accuracy: Memory savings strategies (MeSP (Park et al., 13 Feb 2026), sparse BP (Zhu et al., 2023), TT-decomposition (Kwak et al., 5 Nov 2025), group-shared exponent quantization (Zhou et al., 18 Feb 2025)) typically incur mild to moderate compute overhead (up to for MeSP), and may introduce up to accuracy drop; overtuning or aggressive parameter sparsity can exacerbate performance loss.
- Federated and Split Setting: Adaptive resource allocation (rank selection, sketching ratio, compute split) enables deployment tailoring across device heterogeneity (Fang et al., 31 Jan 2025, Yuan et al., 14 Aug 2025). Communication-efficient protocols and compressed update routing mitigate uplink bottlenecks.
- Privacy and Differential Privacy: Tuning privacy budgets (), clipping norms, and noise scales is essential for balancing utility and formal guarantees (Xu et al., 11 Sep 2025). Centralizing rank randomization and activating secure aggregation are required for DP under dynamic PEFT (Xu et al., 2024).
- Caching and Alignment: Activation caching and skip-forward/cached computation (Skip2-LoRA (Matsutani et al., 2024), PAE MobiLLM (Yang et al., 1 Jul 2025)) amortize device-side compute and accelerate repeated-epoch scenarios. KL-based regularization and trust-weighted aggregation sustain personalization and global knowledge (Wagner et al., 2024, Yuan et al., 14 Aug 2025).
- Quantization Robustness: Quantization methods with block-shared exponents and multi-precision tuning (OTARo (Chen et al., 17 Nov 2025)) enable flexible precision switching and loss robustness in downstream deployment.
7. Future Directions and Open Questions
- End-to-End Quantization and Mixed-Precision: Extension of memory/computation-efficient PEFT and split architectures to uniformly quantized backbones (e.g., INT4, SEFP) across all layers and adapters remains an active area (Li et al., 27 Feb 2025, Chen et al., 17 Nov 2025).
- Dynamic Adaptation to Environment and Hardware: Real-time rank scheduling, sketch ratio adaptation, and hardware-in-the-loop profiling for optimal split/aggregation strategies across devices with dynamically varying conditions are underdeveloped.
- Full Integer and Nonlinear Kernel Support: Even with integer-only propagation and update (GSQ-Tuning), LayerNorm/Softmax remain limited to BF16/floating point; integrated non-linear integer kernels are a major challenge (Zhou et al., 18 Feb 2025).
- Convergence and Generalization Theory: Block coordinate and selective adaptation methods (LCSB, HetLoRA, FSLoRA) are underpinned by nonconvex optimization theory but lack tight, task- and heterogeneity-aware generalization characterizations.
The field of on-device fine-tuning continues to advance toward practical, privacy-enhanced, and resource-adaptive adaptation pipelines for LLMs and DNNs. Progress is marked by the interplay of algorithmic reduction in trainable/communicated state, system and hardware co-design, parallelization of non-traditional optimization methods, and the safeguarding of user privacy and device autonomy (Li et al., 27 Feb 2025, Xu et al., 11 Sep 2025, Zhu et al., 2023, Yang et al., 1 Jul 2025).