C-SAM: Compression-aware Sharpness Minimization
- The paper introduces C-SAM, extending traditional SAM by addressing discrete perturbations from quantization and pruning to yield robust, compressed models.
- It formulates specific optimization objectives for quantization noise and pruning mask perturbations, resulting in flatter loss landscapes and improved certified robustness.
- Empirical results demonstrate that C-SAM enhances model accuracy and stability, outperforming standard training methods on various deep network architectures.
Compression-aware Sharpness Minimization (C-SAM) encompasses a class of training objectives and algorithms that jointly optimize neural network compactness and robust generalization by explicitly minimizing the sensitivity of the loss landscape with respect to structured compression transformations, such as quantization or pruning. In contrast to traditional sharpness-aware training—which targets flatness under infinitesimal weight perturbations—C-SAM extends this principle to discrete, often non-differentiable, model modifications critical for on-device deployment. There are two principal realizations: (1) for quantized networks, C-SAM integrates sharpness-aware minimization in the quantization noise domain (Liu et al., 2021); (2) for pruned networks, C-SAM directly regularizes flatness under pruning mask perturbations (He et al., 28 Jan 2026). Both approaches yield compressed subnetworks characterized by flatter, more stable loss minima and empirically substantiated robustness improvements.
1. Motivation and Problem Statement
Standard Sharpness-Aware Minimization (SAM) introduces adversarial weight perturbations during training to seek flat minima, enhancing resistance to small parameter fluctuations and improving generalization. However, these methods operate exclusively in the continuous weight space and exhibit fundamental limitations when coupled with compression techniques:
- Quantization: The abrupt discretization imposed by uniform quantization transforms full-precision weights into low-bit representations, generating quantization noise . Standard SAM does not account for this non-additive, structured noise.
- Pruning: Structural pruning removes weights or channels entirely via binary masks, resulting in discrete, high-magnitude changes in the network architecture. SAM-trained models, although robust to small perturbations, may collapse under mask flips, as flatness in the weight space does not guarantee flatness to structural perturbations.
Compression-aware Sharpness Minimization generalizes SAM to the compression domain, introducing optimization objectives and regularizers that enforce flatness directly against quantization or structural (mask) perturbations, thereby aligning training robustness with eventual deployment constraints (Liu et al., 2021, He et al., 28 Jan 2026).
2. Mathematical Formalism and Optimization Objectives
2.1 Quantization-aware C-SAM
Given weights , learnable clipping scales , , and -bit uniform quantizers:
C-SAM frames quantization as stochastic additive noise:
The preferred practical formulation is “quantize-then-SAM”, defining the perturbed quantized loss:
where .
2.2 Pruning-aware C-SAM
Let denote a soft mask over prunable units; are model parameters, broadcasts the mask to match :
- Mask perturbation: ,
- Network output:
The composite loss comprises:
- Stability loss:
- Ratio (margin) loss: leverages classification margin under mask and semantic transformations
- Consistency loss: penalizes the KL divergence between outputs from soft mask perturbations and binarized mask projections
- sparsity regularizer on the mask
The aggregated training objective is:
3. Algorithmic Implementation
3.1 Quantization-aware C-SAM (Look-Style)
The quantize-then-SAM C-SAM algorithm employs a low-overhead “Look-style” outer step:
- Compute the gradient of the vanilla quantized loss
- Every steps, recompute the expensive gradient of the perturbed quantized loss ; decompose into components parallel and orthogonal to
- In between, update using a combination of and the last stored orthogonal component
This yields a training cost of at most over standard SGD/AdamW, with empirical wall-clock overhead for .
3.2 Pruning-aware C-SAM
C-SAM for pruning comprises three stages:
- Pre-training: Train full model with cross-entropy over both clean and semantically transformed examples
- Robust mask search: Freeze weights, optimize mask with perturbation-driven loss as described above; binarize using a top- STE at desired sparsity; update using the total loss
- Post-training: Re-binarize and fine-tune the pruned model
Key hyperparameters include mask noise magnitude , safety margin , loss weights , and percentile-based mask initialization .
4. Theoretical Properties
- For quantized networks, minimizing simultaneously achieves low empirical loss and local flatness, as controlled by SAM radius . Empirical measures show a pronounced reduction in the maximum Hessian eigenvalue , e.g., from (SGD) to (C-SAM) for 4-bit ResNet-50 on ImageNet, indicating a significantly flattened landscape (Liu et al., 2021).
- For mask-space C-SAM, minimizing directly reduces prediction variance under small mask perturbations, thereby tightening an explicit upper bound on semantic prediction discrepancy . The label-invariance lemma () provides a sufficient condition for certified robustness (He et al., 28 Jan 2026).
5. Empirical Results
5.1 Quantization-aware C-SAM
Empirical evaluations on standard vision backbones (ResNet-18/34/50, ViT, MobileNetV2) and datasets (ImageNet, CIFAR-10/100, Flowers-102, Oxford-IIIT Pets) demonstrate:
| Method (Bit-width W/A) | Top-1 ImageNet (%) | |
|---|---|---|
| SGD (LSQ), 4/4 | 76.7 | 71.8 |
| C-SAM, 4/4 | 77.6 (+0.9) | 6.3 |
For ViT-B/16 (4-bit), Top-1 accuracy improved from 78.0% (LSQ+ AdamW) to 79.2% (C-SAM). For MobileNetV2 (ImageNet, 4/4), C-SAM outperformed PROFIT by 0.4%. Transfer learning from 4-bit ResNet-50 showed consistent accuracy improvements across new datasets up to +1% Top-1 (Liu et al., 2021).
5.2 Pruning-aware C-SAM
For unstructured pruning at 50% sparsity (ResNet-18):
| Method | CelebA-PCA (%) | Flowers-PCA (%) | CIFAR-PCA (%) |
|---|---|---|---|
| HYDRA | 26.0 | 63.0 | 53.0 |
| S²-SAM | 58.0 | 72.0 | 33.0 |
| C-SAM | 74.0 | 88.0 | 60.0 |
For structured pruning, C-SAM achieved 66.0% (CelebA-PCA), 83.0% (Flowers-PCA), and 73.0% (CIFAR-PCA), outperforming DepGraph and HESSO baselines. Robustness gains of up to +42% PCA were observed over prior methods on GoogLeNet (pr=50%). On some tasks, C-SAM-pruned models surpassed even unpruned vanilla models in certified robustness (He et al., 28 Jan 2026).
6. Ablations, Limitations, and Practical Recommendations
- Ablation experiments confirm that each loss component contributes to robustness (e.g., removing reduces CelebA PCA from 74% to 68%).
- Hyperparameter sensitivity is moderate: mask noise and safety margin are optimal in most tested scenarios; soft-mask percentile initialization at yields better mask diversity than random or extreme percentile choices.
- C-SAM trains with only modest extra cost over standard pipelines: for quantization, overhead; for pruning, efficiency depends on mask update schedule.
- “SAM-before-pruning” or “SAM-after-pruning” fail to achieve robust subnetworks as pruning masks are fixed or do not adapt for stability; C-SAM, by contrast, integrates subnetwork discovery with sharpness-aware learning, resulting in simultaneously compact and certifiably robust subnetworks.
7. Significance and Research Impact
Compression-aware Sharpness Minimization establishes a unified, algorithmically efficient framework for compression-robust training in deep networks. Distinctive features include: (1) recasting quantization/pruning as structured perturbations for sharpness objectives, (2) precise loss landscape regularization in relevant compressed subspaces, and (3) consistent improvements over prior SOTA in both generalization and certified robustness at high compression rates. C-SAM is of direct relevance to practitioners deploying DNNs under tight memory and inference constraints, particularly for on-device and edge computing scenarios (Liu et al., 2021, He et al., 28 Jan 2026).