Device-Server Co-training: A Collaborative Approach
- Device-server co-training is a collaborative paradigm that enables joint model training on resource-constrained devices and powerful servers without sharing raw data.
- It employs techniques like proxy-data outsourcing, federated SGD with quantization, and split learning to optimize efficiency, scalability, and privacy.
- The approach achieves significant reductions in communication and memory usage while addressing challenges posed by heterogeneous architectures and non-IID data.
Device-Server Co-training is a collaborative paradigm in distributed machine learning that enables resource-constrained devices (clients) and computationally powerful servers (clouds) to jointly train or fine-tune models without full data sharing. The overarching objective is to leverage server-scale compute and public data while preserving device data privacy, reducing memory and communication footprints, and accommodating architectural and domain heterogeneity. Device-server co-training spans methodologies such as collaborative data proxy construction, federated optimization with reduced-precision arithmetic, split learning with hierarchical partitioning, parameter-efficient adapters, and knowledge transfer via distilled proxy models. These regimes facilitate efficient learning on private, heterogeneous, or bandwidth-constrained edge devices.
1. System Architectures and Taxonomy
Device-server co-training architectures can be broadly classified by the locus and division of trainable and frozen parameters, the communication protocols, and the learning objectives:
- Proxy-Data Outsourcing: The device never transmits raw data but communicates abstracted, privacy-preserving signals that enable the server to select or synthesize a proxy training set that approximates the device data distribution (Hong et al., 2022).
- Federated SGD with Quantized Communication: Each device performs local SGD on its data, but only quantized model updates are exchanged with the server, keeping device data local (Wang et al., 2024, Amiri et al., 2021).
- Side-Tuning and Adapters: Devices maintain a frozen backbone and forward activations; parameter-efficient, trainable adapters are located on the server or jointly between device and server (Li et al., 27 Feb 2025, Liu et al., 12 Nov 2025, Zhang et al., 23 Jan 2025, Ma et al., 23 Mar 2025).
- Split and Hierarchical Split Learning: The network is partitioned between device and server (and possibly intermediate edge nodes). Forward activations are transferred sequentially, and only a limited part of the network is updated (e.g. LoRA adapters) (Zhang et al., 23 Jan 2025, Ma et al., 23 Mar 2025, Wang et al., 23 Nov 2025).
- Collaborative Distillation via Proxies: Knowledge from large server models is distilled into device models or lightweight proxy models, which serve as bridges for collaborative training under heterogeneous architectures (Liu et al., 12 Nov 2025, Ding et al., 2023).
This architectural diversity addresses trade-offs between device capacity, server resources, privacy guarantees, network latency, and adaptation fidelity.
2. Representative Methodologies
a) Efficient Collaborative Open-Source Sampling (ECOS)
In ECOS, the server possesses a large, open-source dataset and compresses it into a small set of feature centroids using k-means. The client computes centroid coverage statistics (with differential privacy noise) on its private data and sends these scalars to the server. Using this feedback, the server selects a diverse and proximal subset of open-source samples as a proxy set for training. This approach minimizes uplink bandwidth and never requires transfer of raw or feature-level private data (Hong et al., 2022).
b) Federated and Quantized Update Aggregation
Federated Learning variants select a subset of devices for each communication round based on downlink channel conditions, transmit quantized model parameters, and perform local updates under stochastic gradient descent (SGD). Optimization occurs under capacity and quantization constraints, with trade-offs between statistical and communication error (Amiri et al., 2021). FP8-based federated learning reduces device and communication cost by executing training and weight aggregation in 8-bit floating point arithmetic while maintaining an FP32 server model; unbiased quantization and clipping schemes are used to achieve low communication overhead with minimal impact on convergence (Wang et al., 2024).
c) Split Learning and Hierarchical Partitioning
Hierarchical split learning partitions both the backbone network and trainable LoRA adapters across user, edge, and cloud tiers. Forward and backward propagation stages are pipelined across these tiers; only activations (“smashed data”) and low-rank gradients are communicated, and peak device memory is minimized (Zhang et al., 23 Jan 2025, Wang et al., 23 Nov 2025, Ma et al., 23 Mar 2025). CycleSL further improves split learning by pooling all client features on the server, performing multiple epochs of server optimization (“block coordinate descent”), and cycling back gradient updates, thus stabilizing convergence under non-IID data (Wang et al., 23 Nov 2025).
d) Adapter-Based and Side-Tuning Protocols
Server-assisted side-tuning frameworks decouple trainable adapters or side-networks from the backbone, enabling devices with frozen local backbones to forward quantized activations to the server, which handles backpropagation and updates. This strongly reduces device memory, as only forward-pass buffers and sample activations are ever retained on-device (Li et al., 27 Feb 2025). Structure-agnostic co-tuning with distilled proxy models establishes a bridge for model-to-model knowledge transfer in heterogeneous cloud-edge LLM/SLM ensembles, with mutual learning and domain-aware adapter fine-tuning (Liu et al., 12 Nov 2025).
e) Distillation-Informed Control for Co-optimization
DC-CCL splits vision models vertically, with a large cloud-side model and a small device co-submodel. A distillation-trained control (“mimic”) model is included on the device to provide surrogate gradients, aligning local updates with the global model's direction without transmitting large model weights or raw device data (Ding et al., 2023).
3. Optimization Objectives, Communication Protocols, and Privacy
Optimization in device-server co-training is typically formulated to balance proximity to the device data distribution, model diversity, and privacy leakage:
- Proxy Selection Objectives: ECOS explicitly minimizes the distance between the proxy set and client data in feature space, subject to diversity and differential privacy constraints (Hong et al., 2022).
- Loss Functions: Server-side losses may be supervised, semi-supervised (e.g. FixMatch), or knowledge-distillation-based, depending on the availability of labels and scenario (Hong et al., 2022, Ding et al., 2023).
- Communication Reduction: Nearly all protocols use some form of compression (quantization, centroids, low-bit activations, or lightweight adapter updates) to reduce the round-trip bandwidth. For example, FP8FedAvg achieves – reduction over FP32 in communication load (Wang et al., 2024), and Co-PLM transmits only 0.02% of total model parameters per round (Liu et al., 12 Nov 2025).
- Differential Privacy and Data Minimization: Client-side features or coverage vectors are perturbed with calibrated Gaussian noise, and only lightweight statistics or representations are ever communicated (Hong et al., 2022).
- Pipeline and Overlapping Execution: By freezing device-side parameters and decoupling backpropagation, schemes such as MobiLLM and SplitFrozen overlap device forward passes, communication, and server-side optimization to minimize idle time (Li et al., 27 Feb 2025, Ma et al., 23 Mar 2025).
4. Empirical Evaluation and Trade-Offs
Empirical studies across vision and language tasks highlight advantages in communication, accuracy, compute savings, and robustness to heterogeneity:
| Framework | Comm. Reduction | Accuracy Change | Key Mechanism | Reference |
|---|---|---|---|---|
| ECOS | ≫10× vs raw data | +0.9–5.4 pts | DP-proximal, diverse proxy samples | (Hong et al., 2022) |
| FP8FedAvg-UQ | 2.9–9.5× | ±1–3 % | Unbiased FP8 quantization | (Wang et al., 2024) |
| MobiLLM | 4Ă— memory saved | Unchanged | Frozen backbone, server adapters | (Li et al., 27 Feb 2025) |
| SplitLLM | ≤74 % memory↓ | Comparable | User/edge/cloud model splitting | (Zhang et al., 23 Jan 2025) |
| SplitFrozen | 86.8 % FLOPs↓ | +69.4 % under Non-IID | Device-side frozen layers | (Ma et al., 23 Mar 2025) |
| CycleSL | O(1) server memory | +5–30 pts | Cyclical server-client updates | (Wang et al., 23 Nov 2025) |
| Co-PLMs | 0.02 % parameters per round | +5.38 % Rouge-L | DPM bridges for structure-agnostic learning | (Liu et al., 12 Nov 2025) |
Thus, device-server co-training protocols achieve high communication/compute efficiency and scalability, often matching or approaching full centralized fine-tuning accuracy. However, trade-offs exist: aggressive compression or shallow device splits can elevate error; heterogeneity of device data or hardware can complicate load balancing.
5. Design Challenges, Assumptions, and Limitations
Key design points include:
- Data Distribution and Overlap: ECOS and proxy-based schemes rely on sufficient overlap between server-side open-source data and the private device distribution for effective sampling (Hong et al., 2022).
- Feature Extractor Suitability: Many protocols assume access to a fixed extractor that produces semantically meaningful embeddings across data sources (Hong et al., 2022).
- Hardware Constraints: FP8 training and side-tuning gains are maximized with native hardware support, which is not universally available (Wang et al., 2024).
- Model Partitioning Heuristics: Split methods must select split points and frozen-layer depth considering device capacity, communication cost, and accuracy loss; poor choices can result in suboptimal convergence or bottlenecks (Ma et al., 23 Mar 2025).
- Privacy/Utility Trade-offs: Privacy-preserving additions (differential privacy noise, lossy quantization) may moderately degrade ranking or coverage quality in extreme low-data regimes (Hong et al., 2022).
Future directions identified include multi-round co-training and iterative proxy sampling (Hong et al., 2022), learnable or adaptive feature extractors, extensions beyond vision (to NLP or tabular data (Hong et al., 2022, Liu et al., 12 Nov 2025)), and integrated augmentation with federated learning (Hong et al., 2022).
6. Comparative Frameworks and Empirical Benchmarks
Device-server co-training is empirically validated on image classification (MNIST, CIFAR, DomainNet, FEMNIST, CelebA, ImageNet), question answering (SNI, MMLU), sequence-to-sequence tasks, and content generation (GSM8K), using both IID and strongly non-IID splits (Hong et al., 2022, Wang et al., 2024, Wang et al., 23 Nov 2025, Liu et al., 12 Nov 2025, Ma et al., 23 Mar 2025). The leading methodologies achieve superior trade-offs compared to:
- Centralized or federated updates with full-precision or all-layer updates (FedLoRA, vanilla FL)
- Parallel and sequential split learning without feature resampling or aggregation-free updates (PSL, SFL, SGLR)
- Purely local fine-tuning or cloud-only training without device data incorporation
CycleSL achieves up to improvement in accuracy on non-IID CelebA and FEMNIST over classical PSL/SGLR baselines (Wang et al., 23 Nov 2025). SplitFrozen improves under extreme label skew by up to and reduces device FLOPs by (Ma et al., 23 Mar 2025). MobiLLM and SplitLLM bring LLM fine-tuning to commodity or CPU-only devices with memory below $4.5$ GB (Li et al., 27 Feb 2025, Zhang et al., 23 Jan 2025).
7. Outlook and Continuing Challenges
Device-server co-training is evolving toward greater heterogeneity, scalability, and privacy in distributed ML. The integration of distilled proxies and adapters for structure-agnostic mutual learning shows promise for unifying large (server) and small (device) models in real-world consortia (Liu et al., 12 Nov 2025). Extensions to domain-adaptive NLP, pipeline parallelism to mask uplink latency (Li et al., 27 Feb 2025), and block coordinate optimization to minimize client drift (Wang et al., 23 Nov 2025) further advance the field.
Key open questions include optimal partitioning strategies under hardware variability, joint optimization of feature extractors, generalization of co-training protocol to multi-modal or online learning, and the theoretical limits on privacy-utility-communication trade-offs. Emerging standards, adaptive aggregation, and principled proxy/data selection are likely to remain central themes as deployment scales and edge clouds mature.