Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dual Taylor Expansion in Model Pruning

Updated 18 January 2026
  • Dual Taylor Expansion is a method for precise error estimation in neural network pruning that jointly considers weight and activation perturbations.
  • It employs a second-order expansion to compute error metrics, guiding effective mask selection and weight updates in the D^2Prune algorithm.
  • Empirical results demonstrate that this approach outperforms traditional magnitude-based and single-gradient methods, especially at high sparsity levels.

Dual Taylor Expansion is a formalism for precise error estimation in neural network pruning that simultaneously models the effect of weight and activation perturbations. Recently introduced as a core component of D2PruneD^2Prune for compressing LLMs, the Dual Taylor Expansion approach enables more accurate and stable identification of pruning masks and weight updates than prior single-variable or data-agnostic methods (Xiong et al., 14 Jan 2026).

1. Motivation and Context

Conventional pruning methods estimate pruning error by only considering weight perturbations (e.g., via a Taylor expansion with respect to weights alone) or by ignoring data-distributional shifts between calibration and test data. Additionally, such approaches neglect changes in the activation statistics arising from weight sparsification. In high-sparsity regimes, especially for modules with non-Gaussian or long-tailed activation distributions (as in multi-head attention), this leads to inaccurate error estimation and severe performance drops. The Dual Taylor Expansion method addresses these limitations by jointly capturing the contributions from both weight and input perturbations, thus supporting robust mask selection and error minimization during pruning (Xiong et al., 14 Jan 2026).

2. Mathematical Formulation

Let f(w,x)f(w, x) denote the output of a module (layer) given weights ww and input xx. D2PruneD^2Prune uses a second-order expansion (Editor’s term: "Dual Taylor") with respect to both ww and xx, yielding

f(w+δw, x+δx)≈f(w,x)+∇wf(w,x)⋅δw+∇xf(w,x)⋅δx+12δwT∇ww2f(w,x) δw+12δxT∇xx2f(w,x) δx+⋯f(w+\delta w, \ x+\delta x) \approx f(w, x) + \nabla_w f(w, x) \cdot \delta w + \nabla_x f(w, x) \cdot \delta x + \frac{1}{2} \delta w^T \nabla^2_{w w} f(w, x) \ \delta w + \frac{1}{2} \delta x^T \nabla^2_{x x} f(w, x) \ \delta x + \cdots

Here, δw\delta w and δx\delta x represent perturbations induced by pruning and associated input-response shifts; higher-order cross-terms are negligible under mild assumptions. The empirical error caused by pruning a set of weights is then estimated not only by measuring their gradients but also by accounting for how pruned weights interact with typical input variations (Xiong et al., 14 Jan 2026).

In practice, the estimation can be simplified as follows. For a linear (or locally linear) module,

∥Ax−(M⊙A)x∥22≈Tr[(I−M)TAΣxAT(I−M)]\|A x - (M \odot A) x\|_2^2 \approx \mathrm{Tr}\left[ (I - M)^T A \Sigma_x A^T (I - M) \right]

where MM is the pruning mask, AA the full parameter matrix, and Σx\Sigma_x the empirical input covariance over the calibration set. This dual expansion ensures that mask selection penalizes masks that, under the data distribution, induce large output perturbations, and not just those with large weight magnitudes.

3. Algorithmic Role in D2PruneD^2Prune

Dual Taylor Expansion underpins several stages in D2PruneD^2Prune (Xiong et al., 14 Jan 2026):

  • Pruning Mask Selection:

The method computes, for every weight candidate, an error metric aggregating both parameter and input sensitivities, guiding iterative or block-wise mask selection.

  • Weight Update After Pruning:

After masking, D2PruneD^2Prune leverages the Dual Taylor expansion to compute layerwise or blockwise weight updates that best compensate for the structured sparsity, again minimizing expected output error granted realistic input distributions.

Attention projections (q,k,vq, k, v) are pruned not simply on unstructured weight-magnitude or single-sided sensitivity, but with full awareness of the joint effect of weight and activation variability, crucial in preserving the long-tailed, high-selectivity nature of multi-head attention.

This approach is compatible with blockwise structured pruning and can be combined with outlier detection and dynamic update schemes for additional error control.

4. Empirical Results and Comparative Evaluation

Experiments in D2PruneD^2Prune demonstrate that Dual Taylor–guided pruning outperforms both purely magnitude-based (Wanda) and single-gradient-based methods (SparseGPT) in preserving accuracy at extreme sparsity levels (e.g., ≥80%\geq 80\%) (Xiong et al., 14 Jan 2026). Notably:

  • On WikiText2 at 80% sparsity, Dual Taylor methods reduced perplexity to 92.68 (dynamic update) versus 101.88 (SparseGPT) and 5107.20 (Wanda).
  • Layerwise error metrics attributable to pruning are more closely matched to observed test-set degradation with Dual Taylor than with single-variable Taylor formulas.
  • The method yields the lowest Kullback–Leibler divergence between dense (reference) and pruned attention distributions, preserving the essential long-tail structure necessary for factual and reasoning tasks.

These outcomes demonstrate that accounting for activation-conditional effects is critical in high-sparsity model compression.

5. Relationship to Attention Distribution and Model Robustness

Dual Taylor Expansion is especially impactful for attention modules in LLMs, where activation (input) statistics exhibit heavy-tailed and context-sensitive behavior. When pruning q,k,vq, k, v projections, naive approaches often flatten the attention distribution, destroying the selective, high-mass assignments crucial for reasoning and memorization. Dual Taylor–based pruning sustains attention sparsity and selectivity, maintaining low KL divergence to pre-pruned patterns, and thus supports near-lossless compression (Xiong et al., 14 Jan 2026).

6. Extensions and Broader Implications

While introduced for LLM compression, the Dual Taylor Expansion method is applicable to vision transformer pruning (e.g., DeiT), delivering high accuracy on datasets such as ImageNet-1K with substantial parameter and compute reduction (Xiong et al., 14 Jan 2026). The principle further invites extension to other structured, data-dependent model operations where simultaneous sensitivity to weights and activations modulates robustness.

7. Limitations and Future Research

Current Dual Taylor implementations require activation statistics estimation on representative calibration sets. Their accuracy depends on the stationarity of such distributions between calibration and deployment. Further research may explore higher-order corrections, automatic adaptation to nonstationary activation regimes, and integration with explicit attention distribution regularization.


In summary, Dual Taylor Expansion provides a theoretically principled and empirically superior framework for pruning neural networks under realistic data distributions, especially for architectures with complex, nonuniform activation patterns such as transformers with multi-head attention (Xiong et al., 14 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual Taylor Expansion.