Papers
Topics
Authors
Recent
2000 character limit reached

C-SAM: Compression-aware Sharpness Minimization

Updated 4 February 2026
  • The paper introduces C-SAM, extending traditional SAM by addressing discrete perturbations from quantization and pruning to yield robust, compressed models.
  • It formulates specific optimization objectives for quantization noise and pruning mask perturbations, resulting in flatter loss landscapes and improved certified robustness.
  • Empirical results demonstrate that C-SAM enhances model accuracy and stability, outperforming standard training methods on various deep network architectures.

Compression-aware Sharpness Minimization (C-SAM) encompasses a class of training objectives and algorithms that jointly optimize neural network compactness and robust generalization by explicitly minimizing the sensitivity of the loss landscape with respect to structured compression transformations, such as quantization or pruning. In contrast to traditional sharpness-aware training—which targets flatness under infinitesimal weight perturbations—C-SAM extends this principle to discrete, often non-differentiable, model modifications critical for on-device deployment. There are two principal realizations: (1) for quantized networks, C-SAM integrates sharpness-aware minimization in the quantization noise domain (Liu et al., 2021); (2) for pruned networks, C-SAM directly regularizes flatness under pruning mask perturbations (He et al., 28 Jan 2026). Both approaches yield compressed subnetworks characterized by flatter, more stable loss minima and empirically substantiated robustness improvements.

1. Motivation and Problem Statement

Standard Sharpness-Aware Minimization (SAM) introduces adversarial weight perturbations during training to seek flat minima, enhancing resistance to small parameter fluctuations and improving generalization. However, these methods operate exclusively in the continuous weight space and exhibit fundamental limitations when coupled with compression techniques:

  • Quantization: The abrupt discretization imposed by uniform quantization transforms full-precision weights into low-bit representations, generating quantization noise ϵq(w)=Qw(w)w\epsilon_q(w) = Q_w(w) - w. Standard SAM does not account for this non-additive, structured noise.
  • Pruning: Structural pruning removes weights or channels entirely via binary masks, resulting in discrete, high-magnitude changes in the network architecture. SAM-trained models, although robust to small perturbations, may collapse under mask flips, as flatness in the weight space does not guarantee flatness to structural perturbations.

Compression-aware Sharpness Minimization generalizes SAM to the compression domain, introducing optimization objectives and regularizers that enforce flatness directly against quantization or structural (mask) perturbations, thereby aligning training robustness with eventual deployment constraints (Liu et al., 2021, He et al., 28 Jan 2026).

2. Mathematical Formalism and Optimization Objectives

2.1 Quantization-aware C-SAM

Given weights wRdw\in\mathbb{R}^d, learnable clipping scales αw\alpha_w, αx\alpha_x, and bb-bit uniform quantizers:

Qw(w)=αw(2w/αw+12s1),s=1/(2b1)Q_w(w) = \alpha_w \left(2 \left\lfloor \frac{w/\alpha_w+1}{2s} \right\rceil -1 \right), \qquad s=1/(2^b-1)

C-SAM frames quantization as stochastic additive noise:

minw,αw,αxmaxϵs2ρ1ni(w+ϵq(w)+ϵs;(xi,yi))\min_{w,\,\alpha_w,\,\alpha_x} \max_{\|\epsilon_s\|_2 \leq \rho} \frac{1}{n} \sum_i \ell \left( w + \epsilon_q(w) + \epsilon_s; (x_i, y_i) \right)

The preferred practical formulation is “quantize-then-SAM”, defining the perturbed quantized loss:

LPQL(w)=1ni(Qw(w)+ϵ^s(Qw(w));(xi,yi))L_\text{PQL}(w) = \frac{1}{n}\sum_i \ell(Q_w(w) + \hat{\epsilon}_s(Q_w(w)); (x_i,y_i))

where ϵ^s(Qw(w))=ρQwLVQLQwLVQL2\hat{\epsilon}_s(Q_w(w)) = \rho \frac{\nabla_{Q_w} L_{\rm VQL}}{\|\nabla_{Q_w} L_{\rm VQL}\|_2}.

2.2 Pruning-aware C-SAM

Let C[0,1]nC\in[0,1]^n denote a soft mask over prunable units; θ\theta are model parameters, T(C)T(C) broadcasts the mask to match θ\theta:

  • Mask perturbation: Cξ=clip(C+ξ,0,1)C_\xi = \operatorname{clip}(C+\xi,0,1), ξUniform(μ,μ)\xi \sim \mathrm{Uniform}(-\mu,\mu)
  • Network output: pC(x)=f(x;T(C)θ)p_C(x) = f(x;T(C)\odot\theta)

The composite loss comprises:

  • Stability loss: Lstab=ExEξ1,ξ2[pCξ1(x)pCξ2(x)22]L_\text{stab} = \mathbb{E}_x \mathbb{E}_{\xi_1, \xi_2} \left[\|p_{C_{\xi_1}}(x) - p_{C_{\xi_2}}(x)\|_2^2\right]
  • Ratio (margin) loss: LratioL_\text{ratio} leverages classification margin under mask and semantic transformations
  • Consistency loss: LconsisL_\text{consis} penalizes the KL divergence between outputs from soft mask perturbations and binarized mask projections
  • L1L_1 sparsity regularizer on the mask

The aggregated training objective is:

L(C)=λstabLstab+λratioLratio+λconsisLconsis+λ1C1L(C) = \lambda_\text{stab} L_\text{stab} + \lambda_\text{ratio} L_\text{ratio} + \lambda_\text{consis} L_\text{consis} + \lambda_1 \|C\|_1

3. Algorithmic Implementation

3.1 Quantization-aware C-SAM (Look-Style)

The quantize-then-SAM C-SAM algorithm employs a low-overhead “Look-style” outer step:

  • Compute the gradient of the vanilla quantized loss LVQLL_\text{VQL}
  • Every τ\tau steps, recompute the expensive gradient of the perturbed quantized loss LPQLL_\text{PQL}; decompose into components parallel and orthogonal to wLVQL\nabla_w L_\text{VQL}
  • In between, update using a combination of wLVQL\nabla_w L_\text{VQL} and the last stored orthogonal component

This yields a training cost of at most 1+1/τ1+1/\tau over standard SGD/AdamW, with empirical wall-clock overhead 1.11×\lesssim 1.11\times for τ=4\tau=4.

3.2 Pruning-aware C-SAM

C-SAM for pruning comprises three stages:

  • Pre-training: Train full model with cross-entropy over both clean and semantically transformed examples
  • Robust mask search: Freeze weights, optimize mask CC with perturbation-driven loss as described above; binarize using a top-kk STE at desired sparsity; update CC using the total loss L(C)L(C)
  • Post-training: Re-binarize CC and fine-tune the pruned model

Key hyperparameters include mask noise magnitude μ\mu, safety margin η\eta, loss weights λ\lambda, and percentile-based mask initialization τ\tau.

4. Theoretical Properties

  • For quantized networks, minimizing LVQL+LPQLL_\text{VQL} + L_\text{PQL} simultaneously achieves low empirical loss and local flatness, as controlled by SAM radius ρ\rho. Empirical measures show a pronounced reduction in the maximum Hessian eigenvalue λmax\lambda_{\max}, e.g., from 71\approx71 (SGD) to 6.3\approx6.3 (C-SAM) for 4-bit ResNet-50 on ImageNet, indicating a significantly flattened landscape (Liu et al., 2021).
  • For mask-space C-SAM, minimizing LstabL_\text{stab} directly reduces prediction variance under small mask perturbations, thereby tightening an explicit upper bound on semantic prediction discrepancy ZC(x;T)Z_C(x;T). The label-invariance lemma (Z<d    argmaxp(x)=argmaxp(T(x))Z<d \implies \operatorname{argmax} p(x) = \operatorname{argmax} p(T(x))) provides a sufficient condition for certified robustness (He et al., 28 Jan 2026).

5. Empirical Results

5.1 Quantization-aware C-SAM

Empirical evaluations on standard vision backbones (ResNet-18/34/50, ViT, MobileNetV2) and datasets (ImageNet, CIFAR-10/100, Flowers-102, Oxford-IIIT Pets) demonstrate:

Method (Bit-width W/A) Top-1 ImageNet (%) λmax\lambda_{\max}
SGD (LSQ), 4/4 76.7 71.8
C-SAM, 4/4 77.6 (+0.9) 6.3

For ViT-B/16 (4-bit), Top-1 accuracy improved from 78.0% (LSQ+ AdamW) to 79.2% (C-SAM). For MobileNetV2 (ImageNet, 4/4), C-SAM outperformed PROFIT by 0.4%. Transfer learning from 4-bit ResNet-50 showed consistent accuracy improvements across new datasets up to +1% Top-1 (Liu et al., 2021).

5.2 Pruning-aware C-SAM

For unstructured pruning at 50% sparsity (ResNet-18):

Method CelebA-PCA (%) Flowers-PCA (%) CIFAR-PCA (%)
HYDRA 26.0 63.0 53.0
S²-SAM 58.0 72.0 33.0
C-SAM 74.0 88.0 60.0

For structured pruning, C-SAM achieved 66.0% (CelebA-PCA), 83.0% (Flowers-PCA), and 73.0% (CIFAR-PCA), outperforming DepGraph and HESSO baselines. Robustness gains of up to +42% PCA were observed over prior methods on GoogLeNet (pr=50%). On some tasks, C-SAM-pruned models surpassed even unpruned vanilla models in certified robustness (He et al., 28 Jan 2026).

6. Ablations, Limitations, and Practical Recommendations

  • Ablation experiments confirm that each loss component contributes to robustness (e.g., removing LstabL_\text{stab} reduces CelebA PCA from 74% to 68%).
  • Hyperparameter sensitivity is moderate: mask noise μ0.5\mu \approx 0.5 and safety margin η1.0\eta \approx 1.0 are optimal in most tested scenarios; soft-mask percentile initialization at τ=30%\tau=30\% yields better mask diversity than random or extreme percentile choices.
  • C-SAM trains with only modest extra cost over standard pipelines: for quantization, 15%\lesssim15\% overhead; for pruning, efficiency depends on mask update schedule.
  • “SAM-before-pruning” or “SAM-after-pruning” fail to achieve robust subnetworks as pruning masks are fixed or do not adapt for stability; C-SAM, by contrast, integrates subnetwork discovery with sharpness-aware learning, resulting in simultaneously compact and certifiably robust subnetworks.

7. Significance and Research Impact

Compression-aware Sharpness Minimization establishes a unified, algorithmically efficient framework for compression-robust training in deep networks. Distinctive features include: (1) recasting quantization/pruning as structured perturbations for sharpness objectives, (2) precise loss landscape regularization in relevant compressed subspaces, and (3) consistent improvements over prior SOTA in both generalization and certified robustness at high compression rates. C-SAM is of direct relevance to practitioners deploying DNNs under tight memory and inference constraints, particularly for on-device and edge computing scenarios (Liu et al., 2021, He et al., 28 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compression-aware Sharpness Minimization (C-SAM).