Similarity & Transition-Based Pruning

Updated 4 December 2025

Similarity- and transition-based pruning are neural network compression methods that identify and remove redundant filters, tokens, or layers based on similarity metrics.
They leverage mathematical tools like cosine similarity, convolution-based alignment, and Centered Kernel Alignment to quantify cross-layer dependencies.
These techniques enable significant compression and acceleration in models such as CNNs and transformers with minimal impact on accuracy.

Similarity- and transition-based pruning refer to a family of neural network compression techniques that exploit redundancy in learned representations, filters, tokens, or entire layers based on quantifiable measures of similarity and cross-layer dependency. These methods are grounded in rigorous mathematical frameworks that use similarity metrics—such as cosine similarity, convolution-based alignment, or Centered Kernel Alignment (CKA)—to identify information overlap or marginal utility across model components, enabling the systematic removal of redundant structures while maximizing the preservation of functional capacity. Transition-based pruning emphasizes the retention of those elements that actively contribute to downstream network computation, measured via their impact or “utility” in subsequent layers. Modern approaches unify these perspectives, implementing pruning at the level of filters in convolutional networks, tokens in transformers, and whole-layer groups via mutual information proxies.

1. Mathematical Foundations of Similarity and Transition Metrics

Pruning decisions in this paradigm derive from explicit similarity computations. In convolutional neural networks (CNNs), Filter Similarity in Consecutive Layers (FSCL) quantifies the alignment between a filter in one layer and the corresponding input channel of downstream filters by performing a 3D convolution between the filter and a “lifted” version of the subsequent-layer weights, followed by an $L_1$ -norm aggregation. The resulting score

$\mathcal{L}(w^c_{j_0}) = \frac{1}{N^n} \sum_{j^n=1}^{N^n} \|w^c_{j_0,:,:,:} \otimes \hat{w}^n_{j^n,\,j_0,:,:}\|_1$

identifies filters whose features are used by the next layer (Wang et al., 2023).

In transformers, Similarity-Aware Token Pruning (SAINT) computes cosine similarity between the head-averaged key vectors of tokens: $S_{ij} = \frac{\bar k_i \cdot \bar k_j}{\|\bar k_i\|\,\|\bar k_j\|}$ and forms a bipartite graph to identify clusters of redundant tokens; pruning rate is adaptively set via a batch-level voting mechanism based on these redundancy counts (Jeddi et al., 14 Mar 2025).

For global, structured pruning (entire layers), MPruner computes pairwise layer similarities using Centered Kernel Alignment (CKA): $\mathrm{CKA}(X,Y) = \frac{\mathrm{HSIC}(K,L)}{\sqrt{\mathrm{HSIC}(K,K)\,\mathrm{HSIC}(L,L)}}$ where $K, L$ are Gram matrices of activations, providing a principled, global mutual information proxy to identify and collapse highly redundant layer groups (Hu et al., 24 Aug 2024).

2. Pruning Algorithms and Procedural Schemes

Each methodology operationalizes similarity and transition principles with process-specific pipelines.

FSCL (CNNs):

Compute per-filter importance scores via cross-layer similarity.
In each layer, rank and remove filters with lowest scores; remove corresponding input channels from subsequent layers.
Optionally apply a conventional (e.g., $L_1$ -norm) heuristic on the final layer.
Fine-tune the pruned model with standard optimizer settings. The only hyperparameters are per-layer keep-ratios (Wang et al., 2023).

SAINT (Transformers/ViTs/VLMs):

At each transformer block, compute token–token similarity and build a bipartite redundancy graph.
Use batch-level voting on token degrees (number of similar neighbors) to adaptively select a pruning rate $r$ .
Rank tokens by a centrality-style redundancy score, drop the $r$ most redundant, and pass remaining tokens forward.
Pruning decisions are layerwise and dynamic, following a three-phase token evolution strategy (aligner–explorer–aggregator) (Jeddi et al., 14 Mar 2025).

MPruner (CNNs and Transformers):

Use a small data seed to compute all pairwise CKA similarities among a set of layers.
Form high-similarity clusters (CKA $\ge \tau$ ), then prune according to a granularity $k$ (e.g., retain 1 in each group), stopping when accuracy drop exceeds a threshold $\gamma$ .
Fine-tune/freeze pruned/unpruned regions to recover performance. Input hyperparameters are accuracy-drop threshold $\gamma$ and CKA threshold $\tau$ (Hu et al., 24 Aug 2024).

3. Theoretical Underpinnings and Justification

Similarity- and transition-based pruning advances beyond magnitude or single-layer heuristics by mandating that pruned parameters are genuinely inactive or redundant in the broader network context.

In FSCL, theoretical motivation is rooted in cross-layer utility: Only filters whose channel is substantively weighted and aligned in the subsequent layer are retained, ensuring that features surviving pruning materially contribute to the model’s predictive path (Wang et al., 2023).
SAINT’s methodology leverages token evolution dynamics; tokens that become highly similar as the transformer progresses (aligner/aggregator stages) are safe to prune, as removing them carries minimal semantic loss. The dynamic adaptation of pruning rates ensures preservation where representation diversity (explorer stages) peaks (Jeddi et al., 14 Mar 2025).
MPruner adopts a conservative, information-theoretic approach: CKA guarantees that deleted layers, by virtue of their near-identical representation to others, add at most $(1-\tau)$ fractional information. This mathematical framing directly bounds the information loss introduced by pruning (Hu et al., 24 Aug 2024).

4. Empirical Performance and Compression Efficiency

Experimental evaluation across FSCL, SAINT, and MPruner demonstrates superior parameter reduction and acceleration with minimal or no accuracy degradation.

Approach	Model / Dataset	Compression	Accuracy Impact / Speedup
FSCL (Wang et al., 2023)	VGG-16, CIFAR-10	81.5% FLOPs, 89.5% params	Top-1 93.96 → 93.68% (↓0.28%)
	ResNet-50, ImageNet	55.99% FLOPs, 53.8% params	Top-1 76.15 → 75.84% (↓0.31%), Top-5 ↓0.08%
SAINT (Jeddi et al., 14 Mar 2025)	ViT-H/14, ImageNet-1K	~2× throughput	Top-1 86.88 → 86.21% (↓0.67%), bests ToMe/PPT by 0.8%
	LLaVA-13B, VLM benchmarks	75% token drop, ~30% latency↓	<1% loss MME/POPE/GQA, latency to LLaVA-7B
MPruner (Hu et al., 24 Aug 2024)	BERT-base (Yahoo!, Dair AI)	25–50% encoder blocks pruned	Accuracy unchanged or ↑1%, 22–62% inference speedup, 50% memory↓
	ResNet-152, ILSVRC	up to 47% parameter removal	Top-1 drop –3.45% (τ=99%), finer k=2 yields 34% removal at –1.86% drop

These results confirm that similarity- and transition-based strategies can drastically reduce network size and inference cost while maintaining or even improving performance under certain settings.

5. Structural and Application Scope

While core concepts are applicable to both CNNs and transformers, specific implementations exploit architectural details:

FSCL is tailored to structured filter pruning in CNNs, where inter-layer channel-filter dependencies are explicit and pruning can be localized without architectural modifications (Wang et al., 2023).
SAINT generalizes token pruning to vision transformers (ViTs) and vision-LLMs (VLMs), leveraging the triple-phase token evolution and adapting hyperparameters (similarity threshold τ, neighbor count K) per application (Jeddi et al., 14 Mar 2025).
MPruner enables layer-wise collapse in both CNNs and transformers, subject to dimensional compatibility; transformers empirically exhibit higher collapse potential. MPruner also cleanly composes with unstructured pruning (e.g., magnitude/Wanda) (Hu et al., 24 Aug 2024).

Practical routines must address parameter shape compatibility, especially when removing or collapsing layers, and may tune granularity $k$ to prevent mismatches.

6. Strengths, Limitations, and Future Directions

Similarity/transition-based pruning approaches are mathematically principled and empirically robust, with several notable strengths:

Guarantee on information loss tied to similarity threshold (CKA, cosine, etc.).
Flexibility across architectures (CNN, ViT, VLM, transformers).
Hyperparameterization via intuitive accuracy or information thresholds, rather than arbitrary sparsity ratios.
Effective composition with other, orthogonal pruning or quantization methods.

Limitations include computational overhead for global similarity calculation (O( $n \cdot |L|^2$ )), challenges with very deep or large models (CKA overflow, attention dimension mismatches), and reduced gains in architectures with little internal redundancy (Hu et al., 24 Aug 2024). Future work is expected to refine similarity metrics (e.g., nonlinear CKA), extend approaches to LLMs, and integrate with Neural Architecture Search frameworks.

7. Significance within Model Compression Research

Similarity- and transition-based pruning represent a foundational advance within structured compression methodologies, bridging local and global information-theoretic perspectives. By directly measuring and acting on cross-component redundancy and propagation, these methods systematically balance compression with accuracy, and set a standard for interpretability and controllability in compression workflows. Their integration into diverse architectures and widespread empirical validation underscores their prominence and utility in efficient inference applications across domains (Wang et al., 2023, Jeddi et al., 14 Mar 2025, Hu et al., 24 Aug 2024).