InfCoP: Information Consistent Pruning

Updated 15 December 2025

Information Consistent Pruning (InfCoP) is a method for neural network compression that preserves task-relevant information using principles like mutual information, effective parameter count, and flow metrics.
InfCoP includes techniques such as iterative pruning with flow-based early stopping, initialization-aware sparsity, and attention head pruning that directly optimize for minimal information loss.
Empirical studies show that InfCoP reduces computational overhead and maintains high accuracy on tasks like vision–language modeling, with strong theoretical guarantees on robustness and generalization.

Information Consistent Pruning (InfCoP) is a set of principles and methodologies for neural network compression that rigorously preserve information-theoretic properties during pruning. InfCoP exploits measures such as mutual information, effective parameter count, and information flow to provide structural sparsity guarantees without sacrificing the task-relevant information content of deep models. Methods under the InfCoP banner include algorithms for iterative pruning, early stopping, initialization-aware sparsity, and information-theoretic objectives grounded in the Information Bottleneck principle. InfCoP is realized both as a meta-theoretic constraint on the design of pruning schemes and as concrete procedures, such as InfoPrune for vision–LLMs, layerwise-activation mutual information preservation, flow-constrained iterative pruning, and initialization-aware mask selection. The following sections detail the conceptual foundation, theoretical guarantees, algorithms, empirical findings, and comparative positioning of Information Consistent Pruning.

1. Information-Theoretic Motivation and the Information Bottleneck Principle

InfCoP arises from the need to compress deep neural networks while providing principled guarantees on information retention. The theoretical basis is the Information Bottleneck (IB) principle, which frames intermediate representations $Z$ as a tradeoff:

$\min_Z\; \mathcal{L}_{IB} = I(Z;X) - \beta I(Z;Y)$

where $X$ denotes input (e.g., pixels or tokens), $Y$ is task output (e.g., classification label, VQA answer), and $\beta > 0$ balances compression against predictive utility. Pruning in the InfCoP framework is posed as the selection of a substructure $Z_S \subset Z$ so that $I(Z_S;X)$ is minimized (to discard redundancies) while $I(Z_S;Y)$ is kept high (to preserve task-relevant semantics) (Xu et al., 24 Nov 2025).

Masks $M$ and their data-dependence play a pivotal role: in modern theory, the capacity of sparse networks is not just a function of the count of active (nonzero) parameters, but also the mutual information $I(M;D)$ between the mask and the dataset $D$ (Kumar et al., 2 Feb 2024). Thus, an "information-consistent" pruning procedure must explicitly manage both the number of unpruned weights and the data-dependent information encoded by the pruning mask.

2. Key Quantities: Effective Parameters, Entropy-based Rank, and Information Flow

The effectiveness and limits of pruning are quantified with several tightly linked measures:

Effective Parameter Count: Defined as $p_{\mathrm{eff}} := \|w\|_0 + I(M;D)$ , where $w$ is the parameter vector, $M$ is the binary mask, and $I(M;D)$ is the mutual information between the mask and the dataset. This concept is central to the "Sparse Law of Robustness," which governs generalization in pruned models and reveals that high sparsity can only be achieved if the mask is sufficiently data-dependent (Kumar et al., 2 Feb 2024).
Entropy-based Effective Rank (eRank): For a matrix $A$ with singular values $\{\sigma_i\}$ , define the normalized spectrum $p_i = \sigma_i/\sum_{j=1}^r \sigma_j$ and entropy $H(\sigma) = -\sum_i p_i\log p_i$ . The effective rank is $\mathrm{eRank}(A) = \exp\bigl(H(\sigma)\bigr)$ . In InfoPrune, minimizing $\mathrm{eRank}(Z_S) - \mathrm{eRank}(Z)$ gives a direct, differentiable handle on information redundancy (Xu et al., 24 Nov 2025).
Kolmogorov–Smirnov (KS) Distance: The maximal discrepancy between the CDFs of the original and pruned singular value spectra, $D_{KS} = \sup_x |F_{\mathrm{orig}}(x) - F_{\mathrm{pruned}}(x)|$ . This monitors information loss due to spectrum distortion, and minimizing $D_{KS}$ aligns with minimizing the loss in mutual information $I(Y;Z_S)$ (Xu et al., 24 Nov 2025).
Information Flow and Gradient Flow: Information flow (IF) between layers is captured by matrices of absolute pairwise correlations of the activations across adjacent layers. Gradient flow (GF) is given by the layerwise norm of backpropagated gradients. The proximity between pruned and dense networks can be controlled via the $L_2$ distance between these flows, with theoretical guarantees bounding the accuracy drop by the flow discrepancy (Gharatappeh et al., 26 Jan 2025).

3. InfCoP Algorithms and Compression Schemes

Several algorithm classes instantiate the InfCoP principle:

A. InfoPrune: Structural Pruning with Unified Information Objective

InfoPrune realizes InfCoP in vision–LLMs by optimizing a composite loss:

$\mathcal{L} = \mathcal{L}_{\mathrm{task}} + \alpha \| \zeta \|_1 + \beta \Big[\mathrm{eRank}(Z_S) - \mathrm{eRank}(Z)\Big] + \gamma D_{KS}$

where $\zeta_{l,h}$ are learnable importance gates for each attention head, $\| \zeta \|_1$ enforces sparsity, and $(\alpha, \beta, \gamma)$ control the balance among sparsity, redundancy removal, and information loss. Pruning proceeds via:

Attention Head Pruning: Learnable gates $\zeta_{l,h}$ are optimized by backpropagation; after training, heads with $\mathrm{sigmoid}(\zeta_{l,h}) < z$ (e.g., $z = 0.5$ ) are pruned.
FFN Low-rank Compression: A training-free SVD-based scheme prunes directions in the FFN according to an error budget $\epsilon$ so that all but the minimal $k$ singular values needed to preserve $(1-\epsilon^2)$ of the spectrum energy are retained; the block is reparameterized accordingly (Xu et al., 24 Nov 2025).

B. Flow-based Early Stopping in Iterative Magnitude Pruning

Instead of retraining to match accuracy after each pruning step (as in standard IMP), InfCoP (in (Gharatappeh et al., 26 Jan 2025)) monitors IF or GF and halts retraining when the flow distance to the original is below a threshold $\epsilon$ . The pruning is carried out using magnitude or PQ-index criteria, but the distinctive feature is the use of a flow-proximity stopping rule:

At each iteration, retrain only until $\|\Phi(w_e) - \Phi^*\|_2 \leq \epsilon$ , not until accuracy plateaus, substantially reducing retraining time.

C. Information-Consistent Pruning at Initialization

InfCoP variants for pruning at or near initialization use the mutual information between the mask and data to ensure that the effective parameter count remains above the threshold for robust generalization. The practical algorithm repeatedly estimates $I(M;D)$ (using MINE, kNN-MI, or noise correlation proxies), prunes weights according to magnitude or other criteria, but only as far as permitted by $p_{\mathrm{eff}} \geq p_{\min}$ (Kumar et al., 2 Feb 2024).

InfCoP Variant	Core Principle	Distinctive Feature
InfoPrune	Information Bottleneck + eRank/KS	Spectral objectives, VLMs
Flow-based IMP (InfCoP)	IF/GF proximity	Early stopping per flow
Initialization InfCoP	Mutual Information in mask/data	Effective param constraint

4. Theoretical Guarantees and Law of Robustness Extensions

InfCoP is underpinned by formal generalization bounds. The key result is the extension of the classical Law of Robustness: for a pruned neural network with mask $M$ , weights $w$ , and dataset $D$ , the effective parameter count $p_{\mathrm{eff}} = \|M \odot w\|_0 + I(M;D)$ determines the achievable error and robustness. Specifically,

$\mathrm{Lip}(f^W) \geq \Omega\bigl( \epsilon \sqrt{n d / p_{\mathrm{eff}}} \bigr)$

for interpolators fitting below the noise level, with $n$ samples, $d$ input dimension, and $\epsilon$ the noise margin. This reveals that more aggressive pruning (higher sparsity) is perforce coupled to larger $I(M;D)$ ; data-agnostic pruning is fundamentally limited in its achievable sparsity unless $n$ is extremely large (Kumar et al., 2 Feb 2024).

A deduced practical implication is that maintaining $p_{\mathrm{eff}}$ above the required threshold during pruning—whether at initialization or over training—is necessary to avoid catastrophic loss of robustness and downstream accuracy. This information-theoretic barrier connects empirical properties such as lottery ticket existence directly to the properties of the pruning mask and flow metrics (Kumar et al., 2 Feb 2024, Gharatappeh et al., 26 Jan 2025).

5. Empirical Findings and Experimental Evaluation

InfCoP algorithms have demonstrated strong empirical performance and computational advantages.

InfoPrune results on Qwen2VL-7B: Up to 3.2× reduction in FLOPs and 1.8× latency improvement on multimodal benchmarks (VQAv2, TextVQA, GQA) with negligible loss in accuracy (e.g., TextVQA drops <1%), outperforming heuristic and random baselines, as well as prior art YOPO and PruneNet (Xu et al., 24 Nov 2025).
Iterative Pruning with Flow-based Early Stopping: On ResNet-18 (CIFAR-10), InfCoP-IF and InfCoP-GF achieve identical sparsity-accuracy tradeoff to SAP but require only 58–62 retraining epochs (vs 300 for SAP/LTH), delivering >80% reduction in computation cost (Gharatappeh et al., 26 Jan 2025).
Initialization-aware InfCoP: Matches the final sparsities found by standard iterative pruning, but attains this in a single pass by balancing sparsity and mask information content (Kumar et al., 2 Feb 2024).
Ablation on eRank and KS: Use of both terms stabilizes pruning and preserves 1–2% more accuracy than KS-only metrics at comparable compression (Xu et al., 24 Nov 2025).

6. Comparisons and Relation to Other Information-Preserving Pruning Methods

InfCoP distinguishes itself from magnitude-based and heuristic pruning by providing formal criteria for how much information about the data must be present in the pruned architecture, offering actionable stopping criteria and mask selection thresholds. In contrast, Mutual Information Preserving Pruning (MIPP) (Westphal et al., 31 Oct 2024) enforces exact preservation of MI between adjacent layers, guaranteeing the existence of a deterministic mapping required for re-trainability. While both InfCoP and MIPP are founded on information-theoretic measures, InfCoP emphasizes effective parameter counting, flow-based metrics, and explicit IB optimization, whereas MIPP operationalizes per-layer MI constraints to prevent layer collapse and maximize sparsity.

Empirically, InfCoP-based methods outperform classical iterative magnitude pruning (IMP), Smooth Adaptive Pruning (SAP), and spectrum-based methods, especially in sample efficiency and total retraining epochs required. Both InfCoP and MIPP demonstrate robustness to high compression and avoidance of catastrophic layer failure.

7. Practical Considerations, Limitations, and Future Extensions

Tuning flow-thresholds ( $\epsilon$ ) and balancing objective weights are crucial for optimal trade-off between efficiency and performance. The computation of IF/GF or MI can become a bottleneck in extremely large models; subsampling or reduced cadence of flow computation can mitigate overhead. InfCoP is highly effective for structured pruning of attention-based models and CNNs but may require adaptation for architectures with disparate information propagation patterns.

Proposed extensions include application to language modules in multimodal systems, synergy with quantization and distillation under the IB lens, and generalization to non-attention-based architectures such as video and speech transformers (Xu et al., 24 Nov 2025, Gharatappeh et al., 26 Jan 2025). Joint optimization of pruning rates and flow thresholds, and meta-learned cross-dataset flow consistency, are pertinent directions for broadening the practical scope of InfCoP frameworks.