C-SAM: Compression-aware Sharpness Minimization

Updated 4 February 2026

The paper introduces C-SAM, extending traditional SAM by addressing discrete perturbations from quantization and pruning to yield robust, compressed models.
It formulates specific optimization objectives for quantization noise and pruning mask perturbations, resulting in flatter loss landscapes and improved certified robustness.
Empirical results demonstrate that C-SAM enhances model accuracy and stability, outperforming standard training methods on various deep network architectures.

Compression-aware Sharpness Minimization (C-SAM) encompasses a class of training objectives and algorithms that jointly optimize neural network compactness and robust generalization by explicitly minimizing the sensitivity of the loss landscape with respect to structured compression transformations, such as quantization or pruning. In contrast to traditional sharpness-aware training—which targets flatness under infinitesimal weight perturbations—C-SAM extends this principle to discrete, often non-differentiable, model modifications critical for on-device deployment. There are two principal realizations: (1) for quantized networks, C-SAM integrates sharpness-aware minimization in the quantization noise domain (Liu et al., 2021); (2) for pruned networks, C-SAM directly regularizes flatness under pruning mask perturbations (He et al., 28 Jan 2026). Both approaches yield compressed subnetworks characterized by flatter, more stable loss minima and empirically substantiated robustness improvements.

1. Motivation and Problem Statement

Standard Sharpness-Aware Minimization (SAM) introduces adversarial weight perturbations during training to seek flat minima, enhancing resistance to small parameter fluctuations and improving generalization. However, these methods operate exclusively in the continuous weight space and exhibit fundamental limitations when coupled with compression techniques:

Quantization: The abrupt discretization imposed by uniform quantization transforms full-precision weights into low-bit representations, generating quantization noise $\epsilon_q(w) = Q_w(w) - w$ . Standard SAM does not account for this non-additive, structured noise.
Pruning: Structural pruning removes weights or channels entirely via binary masks, resulting in discrete, high-magnitude changes in the network architecture. SAM-trained models, although robust to small perturbations, may collapse under mask flips, as flatness in the weight space does not guarantee flatness to structural perturbations.

Compression-aware Sharpness Minimization generalizes SAM to the compression domain, introducing optimization objectives and regularizers that enforce flatness directly against quantization or structural (mask) perturbations, thereby aligning training robustness with eventual deployment constraints (Liu et al., 2021, He et al., 28 Jan 2026).

2. Mathematical Formalism and Optimization Objectives

2.1 Quantization-aware C-SAM

Given weights $w\in\mathbb{R}^d$ , learnable clipping scales $\alpha_w$ , $\alpha_x$ , and $b$ -bit uniform quantizers:

$Q_w(w) = \alpha_w \left(2 \left\lfloor \frac{w/\alpha_w+1}{2s} \right\rceil -1 \right), \qquad s=1/(2^b-1)$

C-SAM frames quantization as stochastic additive noise:

$\min_{w,\,\alpha_w,\,\alpha_x} \max_{\|\epsilon_s\|_2 \leq \rho} \frac{1}{n} \sum_i \ell \left( w + \epsilon_q(w) + \epsilon_s; (x_i, y_i) \right)$

The preferred practical formulation is “quantize-then-SAM”, defining the perturbed quantized loss:

$L_\text{PQL}(w) = \frac{1}{n}\sum_i \ell(Q_w(w) + \hat{\epsilon}_s(Q_w(w)); (x_i,y_i))$

where $\hat{\epsilon}_s(Q_w(w)) = \rho \frac{\nabla_{Q_w} L_{\rm VQL}}{\|\nabla_{Q_w} L_{\rm VQL}\|_2}$ .

2.2 Pruning-aware C-SAM

Let $C\in[0,1]^n$ denote a soft mask over prunable units; $\theta$ are model parameters, $T(C)$ broadcasts the mask to match $\theta$ :

Mask perturbation: $C_\xi = \operatorname{clip}(C+\xi,0,1)$ , $\xi \sim \mathrm{Uniform}(-\mu,\mu)$
Network output: $p_C(x) = f(x;T(C)\odot\theta)$

The composite loss comprises:

Stability loss: $L_\text{stab} = \mathbb{E}_x \mathbb{E}_{\xi_1, \xi_2} \left[\|p_{C_{\xi_1}}(x) - p_{C_{\xi_2}}(x)\|_2^2\right]$
Ratio (margin) loss: $L_\text{ratio}$ leverages classification margin under mask and semantic transformations
Consistency loss: $L_\text{consis}$ penalizes the KL divergence between outputs from soft mask perturbations and binarized mask projections
$L_1$ sparsity regularizer on the mask

The aggregated training objective is:

$L(C) = \lambda_\text{stab} L_\text{stab} + \lambda_\text{ratio} L_\text{ratio} + \lambda_\text{consis} L_\text{consis} + \lambda_1 \|C\|_1$

3. Algorithmic Implementation

3.1 Quantization-aware C-SAM (Look-Style)

The quantize-then-SAM C-SAM algorithm employs a low-overhead “Look-style” outer step:

Compute the gradient of the vanilla quantized loss $L_\text{VQL}$
Every $\tau$ steps, recompute the expensive gradient of the perturbed quantized loss $L_\text{PQL}$ ; decompose into components parallel and orthogonal to $\nabla_w L_\text{VQL}$
In between, update using a combination of $\nabla_w L_\text{VQL}$ and the last stored orthogonal component

This yields a training cost of at most $1+1/\tau$ over standard SGD/AdamW, with empirical wall-clock overhead $\lesssim 1.11\times$ for $\tau=4$ .

3.2 Pruning-aware C-SAM

C-SAM for pruning comprises three stages:

Pre-training: Train full model with cross-entropy over both clean and semantically transformed examples
Robust mask search: Freeze weights, optimize mask $C$ with perturbation-driven loss as described above; binarize using a top- $k$ STE at desired sparsity; update $C$ using the total loss $L(C)$
Post-training: Re-binarize $C$ and fine-tune the pruned model

Key hyperparameters include mask noise magnitude $\mu$ , safety margin $\eta$ , loss weights $\lambda$ , and percentile-based mask initialization $\tau$ .

4. Theoretical Properties

For quantized networks, minimizing $L_\text{VQL} + L_\text{PQL}$ simultaneously achieves low empirical loss and local flatness, as controlled by SAM radius $\rho$ . Empirical measures show a pronounced reduction in the maximum Hessian eigenvalue $\lambda_{\max}$ , e.g., from $\approx71$ (SGD) to $\approx6.3$ (C-SAM) for 4-bit ResNet-50 on ImageNet, indicating a significantly flattened landscape (Liu et al., 2021).
For mask-space C-SAM, minimizing $L_\text{stab}$ directly reduces prediction variance under small mask perturbations, thereby tightening an explicit upper bound on semantic prediction discrepancy $Z_C(x;T)$ . The label-invariance lemma ( $Z<d \implies \operatorname{argmax} p(x) = \operatorname{argmax} p(T(x))$ ) provides a sufficient condition for certified robustness (He et al., 28 Jan 2026).

5. Empirical Results

5.1 Quantization-aware C-SAM

Empirical evaluations on standard vision backbones (ResNet-18/34/50, ViT, MobileNetV2) and datasets (ImageNet, CIFAR-10/100, Flowers-102, Oxford-IIIT Pets) demonstrate:

Method (Bit-width W/A)	Top-1 ImageNet (%)	$\lambda_{\max}$
SGD (LSQ), 4/4	76.7	71.8
C-SAM, 4/4	77.6 (+0.9)	6.3

For ViT-B/16 (4-bit), Top-1 accuracy improved from 78.0% (LSQ+ AdamW) to 79.2% (C-SAM). For MobileNetV2 (ImageNet, 4/4), C-SAM outperformed PROFIT by 0.4%. Transfer learning from 4-bit ResNet-50 showed consistent accuracy improvements across new datasets up to +1% Top-1 (Liu et al., 2021).

5.2 Pruning-aware C-SAM

For unstructured pruning at 50% sparsity (ResNet-18):

Method	CelebA-PCA (%)	Flowers-PCA (%)	CIFAR-PCA (%)
HYDRA	26.0	63.0	53.0
S²-SAM	58.0	72.0	33.0
C-SAM	74.0	88.0	60.0

For structured pruning, C-SAM achieved 66.0% (CelebA-PCA), 83.0% (Flowers-PCA), and 73.0% (CIFAR-PCA), outperforming DepGraph and HESSO baselines. Robustness gains of up to +42% PCA were observed over prior methods on GoogLeNet (pr=50%). On some tasks, C-SAM-pruned models surpassed even unpruned vanilla models in certified robustness (He et al., 28 Jan 2026).

6. Ablations, Limitations, and Practical Recommendations

Ablation experiments confirm that each loss component contributes to robustness (e.g., removing $L_\text{stab}$ reduces CelebA PCA from 74% to 68%).
Hyperparameter sensitivity is moderate: mask noise $\mu \approx 0.5$ and safety margin $\eta \approx 1.0$ are optimal in most tested scenarios; soft-mask percentile initialization at $\tau=30\%$ yields better mask diversity than random or extreme percentile choices.
C-SAM trains with only modest extra cost over standard pipelines: for quantization, $\lesssim15\%$ overhead; for pruning, efficiency depends on mask update schedule.
“SAM-before-pruning” or “SAM-after-pruning” fail to achieve robust subnetworks as pruning masks are fixed or do not adapt for stability; C-SAM, by contrast, integrates subnetwork discovery with sharpness-aware learning, resulting in simultaneously compact and certifiably robust subnetworks.

7. Significance and Research Impact

Compression-aware Sharpness Minimization establishes a unified, algorithmically efficient framework for compression-robust training in deep networks. Distinctive features include: (1) recasting quantization/pruning as structured perturbations for sharpness objectives, (2) precise loss landscape regularization in relevant compressed subspaces, and (3) consistent improvements over prior SOTA in both generalization and certified robustness at high compression rates. C-SAM is of direct relevance to practitioners deploying DNNs under tight memory and inference constraints, particularly for on-device and edge computing scenarios (Liu et al., 2021, He et al., 28 Jan 2026).

Markdown Upgrade to Chat

References (2)

Sharpness-aware Quantization for Deep Neural Networks (2021)

Towards Compact and Robust DNNs via Compression-aware Sharpness Minimization (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compression-aware Sharpness Minimization (C-SAM).

C-SAM: Compression-aware Sharpness Minimization

1. Motivation and Problem Statement

2. Mathematical Formalism and Optimization Objectives

2.1 Quantization-aware C-SAM

2.2 Pruning-aware C-SAM

3. Algorithmic Implementation

3.1 Quantization-aware C-SAM (Look-Style)

3.2 Pruning-aware C-SAM

4. Theoretical Properties

5. Empirical Results

5.1 Quantization-aware C-SAM

5.2 Pruning-aware C-SAM

6. Ablations, Limitations, and Practical Recommendations

7. Significance and Research Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

C-SAM: Compression-aware Sharpness Minimization

1. Motivation and Problem Statement

2. Mathematical Formalism and Optimization Objectives

2.1 Quantization-aware C-SAM

2.2 Pruning-aware C-SAM

3. Algorithmic Implementation

3.1 Quantization-aware C-SAM (Look-Style)

3.2 Pruning-aware C-SAM

4. Theoretical Properties

5. Empirical Results

5.1 Quantization-aware C-SAM

5.2 Pruning-aware C-SAM

6. Ablations, Limitations, and Practical Recommendations

7. Significance and Research Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research