Papers
Topics
Authors
Recent
2000 character limit reached

Skewness-Guided Pruning in Swin Transformers

Updated 16 December 2025
  • The paper demonstrates that computing skewness in activation outputs enables selective pruning of uninformative MSA heads and MLP groups while maintaining accuracy.
  • The method strategically reduces model complexity—achieving up to 64% communication savings—by replacing entire blocks with identity mappings when components are pruned.
  • The approach integrates into a federated learning framework using public server data, ensuring privacy and efficient edge deployment for skin lesion classification.

A skewness-guided pruning method is a model compression and structure-adaptation technique applicable to multimodal Swin Transformers, especially in the context of federated skin lesion classification on edge devices. This approach leverages the statistical skewness of output distributions in Multi-Head Self-Attention (MSA) and Multi-Layer Perceptron (MLP) layers to selectively remove uninformative heads and neuron groups, thereby substantially reducing model complexity while maintaining predictive performance. The method is validated and deployed in a horizontal federated learning (FL) framework, achieving significant reductions in resource demands and communication overhead, which are critical for practical edge deployment under strict privacy and compute constraints (Paxton et al., 9 Dec 2025).

1. Mathematical Foundation: Skewness in Activation Distributions

Skewness, specifically the Fisher–Pearson standardized moment coefficient, quantifies the asymmetry of a real-valued distribution. For a sample {x1,...,xN}\{x_1, ..., x_N\} representing norms of feature vectors, the sample mean μ\mu and variance σ2\sigma^2 are

μ=1Ni=1Nxi,σ2=1Ni=1N(xiμ)2.\mu = \frac{1}{N} \sum_{i=1}^N x_i,\quad \sigma^2 = \frac{1}{N} \sum_{i=1}^N (x_i-\mu)^2.

The skewness is defined as

skew(x)=1Ni=1N(xiμ)3(1Ni=1N(xiμ)2)3/2=1Ni(xiμ)3σ3.\text{skew}(x) = \frac{\frac{1}{N} \sum_{i=1}^N (x_i-\mu)^3}{(\frac{1}{N} \sum_{i=1}^N (x_i-\mu)^2)^{3/2}} = \frac{\frac{1}{N} \sum_{i} (x_i-\mu)^3}{\sigma^3}.

In the skewness-guided pruning context, xix_i are chosen as the 2\ell_2-norms of the outputs of MSA heads or groups of expanded neurons in the MLP.

A positive skewness in these outputs is interpreted as an indicator of functional specificity—such as “attending to salient (lesion) tokens”—while non-positive skewness is interpreted as the absence of informative or discriminative representation contribution (Paxton et al., 9 Dec 2025).

2. Pruning Protocol: Computation and Application

The skewness-guided pruning protocol involves the following computational steps, illustrated below for both MSA heads and MLP groups:

  • MSA Head Pruning:
    • Compute, for each head hh, the 2\ell_2-norms of the projected attention output vectors over a batch. For tensor ARB×W2×H×DA \in \mathbb{R}^{B\times W^2\times H\times D}, vectorize the set vi,j(h)=A0,i,j,h2v_{i,j}^{(h)} = \|A_{0,i,j,h}\|_2 for all spatial tokens.
    • Calculate the sample mean μh\mu_h, variance, and skewness shs_h using the formulation above.
    • Retain heads with sh>0s_h > 0; prune those with sh0s_h \leq 0.
  • MLP Group Pruning:
    • Partition the expansion (intermediate) layer output ZRB×W2×CZ\in\mathbb{R}^{B\times W^2\times C} into GG groups. For each group gg, compute zi,j(g)=Z0,i,j,g2z_{i,j}^{(g)}=\|Z_{0,i,j,g}\|_2.
    • Compute the group-wise skewness sgs_g and similarly prune all groups with sg0s_g \leq 0.

When all heads or groups within a block are pruned, the procedure replaces the respective MSA block with an identity mapping, I(Q)=QI(Q)=Q, or collapses the entire MLP to a LayerNorm followed by identity, preserving output dimensionality and residual pathways.

3. Integrated Pruning Workflow

The pruning and fine-tuning workflow operates as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
Input:  Pretrained Swin Transformer M
Output: Pruned & Fine-tuned model M*

1.  M' ← copy(M)
2.  For each Swin stage s = 0…S–1:
    2.1 For each block b in stage s:
        a)  Compute sₕ for every MSA head h in (s,b) using first-batch norms.
        b)  Identify prune_heads = {h | sₕ ≤ 0}.
        c)  Remove those heads from M'; if all heads pruned, replace MSA with identity.
        d)  Compute s_{g} for each MLP group g likewise.
        e)  Remove those groups from the intermediate layer; if all pruned, collapse MLP to LayerNorm + identity.
    2.2 Freeze parameters of stage s; fine-tune the remaining parameters on the target task.
3.  Return M'

This approach ensures that the model maintains architectural integrity for residual connection compatibility while incrementally freezing and fine-tuning block parameters.

4. Federated Learning Coordination and Privacy

Within the FL regime, pruning is executed centrally at the server using a public “open” subset (20% of HAM10000) reserved for this purpose. At each round post-FedAvg aggregation:

  • The server applies the skewness-prune operation to the aggregated central model—with no access to private client data.
  • The reduced model and associated mask are then broadcast to all clients.
  • Clients fine-tune locally on their private data splits and share only model weight updates (deltas), not raw data or mask details.

All pruning statistics depend exclusively on public data, addressing privacy constraints intrinsic to horizontal FL (Paxton et al., 9 Dec 2025).

5. Empirical Evaluation and Results

The method is validated using HAM10000 (10,000 dermoscopic images, 7 classes); the pruning is performed with the public 20% server split, and clients are assigned 80% train, 20% val from the remaining data, totaling six clients. Hardware emulation targets edge constraints (≤50 MB model size, <2.5 GFLOPS per inference).

Metrics and Results Table

Setting Accuracy (%) F1-score (%) Params (M) Model File (MB) GFLOPS Comm. Savings (%)
Single 85.1 76.9 27.7 108.7 4.36
Pruned 84.4 74.7 10.4 41.2 2.32
FL 83.8 70.9 27.7 108.6 4.36
FL Pruned 83.8 69.6 10.0 39.5 2.16 ~64

Accuracy was preserved within 0.7 percentage points for single-model pruning, and there was no drop in accuracy for FL pruning; communication loads were reduced by approximately 64% per round.

6. Limitations and Prospects for Generalization

The current method computes skewness from a single warm-up batch, introducing the possibility of fluctuating estimates. Potential improvements include the use of running or layered statistics, adaptive per-layer thresholds (rather than fixed zero), and percentile-based selection criteria. The dependence on public server data for pruning imposes a constraint on full federated privacy; eliminating the need for even an open dataset remains an open challenge. Extension to larger Vision Transformers (ViTs), cross-modal networks (e.g., audio–text), and dynamic token pruning is plausible. Future directions also involve advanced FL aggregation strategies (e.g., FedProx), fusion layers for multimodal inputs, and further reductions via low-resolution training (Paxton et al., 9 Dec 2025).

7. Context and Significance

The skewness-guided pruning method directly addresses pressing challenges in scaling high-performance vision models for edge healthcare deployment by jointly optimizing compute, memory, and communication costs in a privacy-preserving federated infrastructure. Its design is motivated by the empirical observation that, in practice, a significant fraction of attention heads and MLP neuron groups in Swin Transformer architectures make little contribution to task-relevant representation, as diagnosed by the skewness of their activation distributions. This suggests utility for efficient model compression beyond the specific medical imaging use case, particularly where edge inference and privacy requirements are paramount.

The approach integrates seamlessly within the federated learning pipeline without violating data locality principles, leveraging only a small pool of open server data for structural adaptation. The demonstrated empirical results affirm strong compression–accuracy tradeoffs not only for single models but also within truly distributed, communication-bounded FL scenarios. A plausible implication is that skewness statistics may provide generally applicable saliency criteria for structural pruning in other deep network settings, especially those characterized by heterogeneous data and edge constraints (Paxton et al., 9 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Skewness-Guided Pruning Method.