FedSGT: Federated Sequential Group Training

Updated 5 December 2025

FedSGT is a federated learning framework that partitions clients into groups and uses sequential chain updates to improve convergence and personalization.
It employs diverse grouping strategies such as clustering and random slicing to address non-i.i.d. challenges and reduce communication rounds.
Empirical results show accelerated training, enhanced model accuracy, and precise unlearning with minimal retraining overhead.

Federated Sequential Group-based Training (FedSGT) is a class of federated learning (FL) frameworks that enhance scalability, heterogeneity robustness, group personalization, and data unlearning through the formation of client groups and sequential intra-group update protocols. The key paradigm is to partition clients (or data slices) into “groups” and coordinate training with sequential, chain-based updates within each group, followed by (possibly weighted) aggregation of group-level updates. FedSGT instances have been developed for acceleration on non-i.i.d. data, group-based personalization, robust handling of client heterogeneity, and—most recently—exact federated unlearning with minimal retraining overhead (Zaccone et al., 2022, Liu et al., 2022, Zeng et al., 2022, Zhang et al., 28 Nov 2025).

1. Core Algorithmic Principle

FedSGT frameworks operate through three essential stages:

Group Formation: Clients (or their data partitions) are clustered into groups (“superclients” or “clusters”) according to data homogeneity, client similarity, or uniform/random assignment depending on the target application. Various strategies exist, including k-means clustering, greedy anti-clustering, centroid equivalence grouping, and random slicing.
Sequential Group-based Training: Within each group, model parameters are passed sequentially (chain fashion) through the group’s clients or slices, with each local client or slice performing a fixed number of local update epochs. Results are propagated along the chain, with the group’s terminal update representing its aggregate contribution.
Global Aggregation: The server aggregates the updated models from all (or a subset of) groups, typically via weighted averaging based on group data size or client participation. This is repeated for multiple communication rounds or model permutations, depending on the FedSGT variant.

The sequential intra-group scheduling is motivated by the need to reduce model drift and better simulate the effect of larger, more i.i.d. mini-batches, particularly in the presence of severe data heterogeneity. In certain instantiations (e.g., for unlearning), sequential training chains are combined with lightweight modularization (PEFT adapters) and multiple, permuted group orders to boost both utility and deletion robustness (Zhang et al., 28 Nov 2025).

2. Formalization and Mathematical Objectives

The standard federated objective in FL is

$\min_{\theta} F(\theta) = \sum_{k=1}^K p_k L_k(\theta)$

where $L_k$ is the local loss at client $k$ and $p_k$ is the data-weighted importance.

FedSGT achieves this through a structured update:

For group $S_i$ or $G_m$ , the group performs a sequential update chain, each client $k$ updating the parameter vector via local SGD epochs.
The global parameter is formed by weighted aggregation of the group models:

$\theta_{t+1} = \sum_{S_i} \frac{|D_{S_i}|}{\sum_j |D_{S_j}|}\theta_{t+1}^{S_i}$

In personalized/group FL, a three-level optimization (global → group → local) is employed:
- Stage I: $\theta^G = \arg\min_\theta\,\mathcal{L}_{glob}(\theta)$ (global)
- Stage II: $\theta_m^* = \arg\min_\theta\,\mathcal{L}_{group}^m(\theta)$ (groupwise)
- Stage III: $\theta_i^* = \arg\min_\theta\,F_i(\theta)$ (client-specific) (Liu et al., 2022)

For unlearning, the group sequence and PEFT modules form a compositional model: deactivating the adapters for any group instantaneously achieves exact unlearning for all downstream groups in each permutation (Zhang et al., 28 Nov 2025).

3. Grouping Strategies and Protocols

FedSGT groupings are designed to minimize inter-group distributional skew, ensure homogeneity, or satisfy random coverage for exact unlearning. Notable approaches:

Distributional Summaries: Use classifier weights or confidence vectors; pairwise distance metrics (e.g., KL divergence, cosine) guide the clustering (FedSeq) (Zaccone et al., 2022).
Centroid Equivalence Theorem: Groups formed by “round-robin” selection from equal-size clusters minimize group divergence (FedGSP) (Zeng et al., 2022).
Random Slicing: For exact unlearning, data is uniformly split into atomic slices and assigned to groups by permuting slice indices (FedSGT for unlearning) (Zhang et al., 28 Nov 2025).

Client participation constraints and group size are carefully balanced: constraints such as $|S| \leq K_{S_{max}}$ or minimum group sample size $|D_S|\geq D_{S_{min}}$ control overhead and delay.

4. Protocol Instantiations and Empirical Results

Different FedSGT variants are designed for classic acceleration, personalization, robust non-i.i.d. learning, and data deletion:

Variant	Grouping Method	Intra-Group Updates	Aggregation	Notable Results / Claims	Reference
FedSeq	Pretrained confidence + KL	Sequential chain	Weighted average	$6\text{–}13\times$ acceleration; accuracy parity with SOTA	(Zaccone et al., 2022)
GroupPerFL	Metadata/clustering	Groupwise FedAvg/FedAdam	3-stage loop	$2$– $12\%$ lower PPL than PerFL; $1.75\times$ compute	(Liu et al., 2022)
FedGSP	ICG, centroid-based	Sequential-to-parallel	Dynamic aggregation	$+3.7\%$ accuracy gain; $>90\%$ reduction in rounds	(Zeng et al., 2022)
FedSGT (unlearning)	Uniform random slicing	Chain with PEFT adapters	Permutation-based	$3\times$ more deletions tolerated; $1.1$– $1.3\times$ overhead	(Zhang et al., 28 Nov 2025)

Empirical results demonstrate that FedSGT consistently lowers required communication rounds, boosts personalization performance especially under scarcity, and enables efficient unlearning at scale.

5. Theoretical Properties

Formal convergence analysis for classic FedSGT is derived from standard FedAvg and SGD chaining arguments:

Convergence Rate: For smooth, bounded-variance objectives, sequential chaining in groups maintains an $O(1/\sqrt{T})$ convergence like FedAvg (Zaccone et al., 2022, Zeng et al., 2022).
Deletion Robustness: Deletion rate scales as $L H_{B'}$ , where $L$ is group count and $B'$ the sequence budget, per coupon collector logic (Zhang et al., 28 Nov 2025).
Statistical Equivalence: For exact unlearning, truncated models after group module deactivation are distributionally identical to retraining from scratch on the retained data (Zhang et al., 28 Nov 2025).
Bayesian Hierarchy: In personalized/group FL, the 3-level groupwise update is interpretable as MAP inference in a hierarchical Gaussian prior with variance decreasing at each aggregation stage (Liu et al., 2022).

6. Integration with Established Methods

FedSGT is orthogonal to many FL modifications:

FedAvg: The base server aggregation primitive is retained, with intra-group updates modified to be sequential or group/chained.
FedProx/FedDyn: Proximal and dynamic regularization integrate directly into each chain step, boosting drift control and convergence speed, especially under non-i.i.d. scenarios (Zaccone et al., 2022).
PEFT (LoRA, Adapters): For unlearning, FedSGT relies on parameter-efficient modules that localize group-specific contributions, enabling instant deletion and separation of influence (Zhang et al., 28 Nov 2025).

Empirically, hybridizations (e.g., FedSeq+FedDyn) yield additional accelerations and improved final accuracy.

7. Limitations and Practical Guidance

Strengths:

Accelerates convergence under heterogeneity and non-i.i.d. distributions.
Enables group and client-level personalization.
Realizes exact, instant unlearning without retraining or downtime.
Storage and communication costs are modest with PEFT (Zhang et al., 28 Nov 2025).

Limitations:

Server may need to store $O(B' \cdot L)$ lightweight modules for unlearning (Zhang et al., 28 Nov 2025).
Uniform random grouping in unlearning is not always optimal for privacy-utility trade-off; specialized heuristics may be advantageous.
Extensions to vertical FL or asynchronous update regimes are non-trivial.
Full convergence proofs for certain variants remain open.

Recommendations:

Pre-train local models for reliable distribution summaries prior to grouping.
Use confidence vector KL-divergence + greedy grouping for strong non-i.i.d.; random grouping suffices for near-i.i.d.
Limit chain length to reduce client delay and keep sequential passes short ( $E_S=1$ typically optimal).
Monitor per-round accuracy and communicate only as necessary; FedSGT variants offer significant round reductions for fixed accuracy targets.
For unlearning, maximize both the group count $L$ and sequence budget $B'$ within overhead constraints.

FedSGT mechanisms have established themselves as foundational techniques for advanced federated learning, offering robust solutions to the dual challenges of statistical/data heterogeneity and operational efficiency across acceleration, personalization, and privacy-compliant unlearning (Zaccone et al., 2022, Liu et al., 2022, Zeng et al., 2022, Zhang et al., 28 Nov 2025).