Sharpness-Aware & Reliable Entropy Minimization (SAR)
- SAR is a training and adaptation paradigm that integrates sharpness-aware minimization with entropy-based corrections to enhance generalization and confidence calibration.
- The method employs a min-max optimization approach to encourage convergence to flat minima while explicitly penalizing overconfident predictions through calibrated entropy regularization.
- Test-time adaptation with SAR² and additional feature regularizers mitigates representation collapse and stabilizes performance under dynamic, shifted data distributions.
Sharpness-Aware and Reliable Entropy Minimization (SAR) refers to a set of training and adaptation paradigms that jointly leverage the generalization properties of sharpness-aware minimization (SAM) with explicit mechanisms to control confidence calibration and representation collapse, primarily through entropy-based corrections. SAR arises in both standard supervised training and online test-time adaptation. Its core principle is to produce neural networks whose predictive probabilities are robust under distribution shifts and whose confidence estimates reliably track the true accuracy.
1. Theoretical Foundations and Formulation
Sharpness-Aware Minimization (SAM) is formulated as a min-max optimization problem that seeks parameter vectors which are robust to worst-case weight perturbations of magnitude . In contrast to standard stochastic gradient descent (SGD), which minimizes the empirical risk,
SAM instead solves
At each step, the inner maximization is approximated by a first-order Taylor expansion, yielding
and the update for step is
This mechanism encourages convergence to flat minima, improving generalization and robustness.
A key theoretical finding is that, for the standard cross-entropy loss, SAM not only discourages sharp minima but also implicitly regularizes the negative entropy of predictive distributions. Specifically, after the SAM ascent step, the model's confidence in the correct class is strictly reduced, which has the effect of "softening" predictions and thereby combats overconfidenceāa prevalent defect in neural classification models (Tan et al., 29 May 2025).
2. Calibration, Implicit Entropy Regularization, and CSAM
Overconfident predictions and miscalibration are addressed in SAR by exploiting SAM's entropy-regularizing effect. For the cross-entropy loss , under mild conditions,
which forces lower-confidence on the correct class after the weight perturbation.
The entropy bound formalizes that minimizing the worst-case perturbed loss is equivalent to minimizing the standard loss plus an explicit negative entropy penalty:
where 0 and 1.
The Calibrated SAM (CSAM) variant refines this by adapting the loss to further disincentivize overconfidence, especially for high-confidence predictions, via
2
which yields an amplified entropy regularization for overconfident predictions (Tan et al., 29 May 2025).
3. Test-Time Adaptation: SAR and SAR² Algorithms
Sharpness-Aware and Reliable Entropy Minimization (SAR) as a Test-Time Adaptation (TTA) method applies these principles to online domain shift. At test time, SAR combines entropy minimization, sharpness-aware robustification, and explicit sample filtering:
- The per-sample entropy loss,
3
is minimized on samples excluded by a mask 4, which filters out examples whose entropy is above a critical threshold (indicating unreliability or noisiness).
- A sharpness-aware penalty,
5
biases the parameter update toward flatter minima by evaluating gradients at a locally perturbed point.
- The total batch loss is
6
where 7 is a balancing hyperparameter.
The SAR² algorithm further prevents representational collapse in wild test streams with two additional centroid-based feature-space regularizers:
- Redundancy regularizer 8 minimizes correlations among class prototypes (computed via feature centroids),
- Inequity regularizer 9 encourages balanced class assignments by maximizing the entropy of the global centroid prediction.
The sharpness-aware analogs of these regularizers are constructed as maxima over local perturbations, analogous to entropy sharpness.
4. Calibration and Feature Regularization Metrics
SAR's effectiveness hinges upon calibration metrics that quantify the match between predicted confidences and empirical correctness. The principal evaluation is:
- Expected Calibration Error (ECE): Test examples are binned by confidence; for each bin 0, the difference between prediction accuracy and average maximum softmax is
1
- Adaptive ECE (AdaECE): Adjusts binning per class for finer evaluation.
- Temperature-Scaled Calibration Error (TCE): ECE measured after temperature scaling.
Feature space regularization is evaluated by:
- Redundancy (mean off-diagonal squared correlation in feature covariance) and
- Inequity (entropy deviation from uniform for pooled centroids).
5. Empirical Evaluation
Empirical results on canonical benchmarks demonstrate the effect and stability of SAR:
| Approach | Dataset | ECE (%) | Additional Notes |
|---|---|---|---|
| SGD | CIFAR-100 | ā3.95 | Baseline |
| SAM | CIFAR-100 | ā2.11 | ~50ā70% reduction over SGD |
| SAM | ImageNet-1K (ViT) | 9.72 ā 1.76 | ECE reduction, Top-1 acc gain 0.3ā4% |
| CSAM | CIFAR-100 | ā1.93 | Outperforms SAM, maintains accuracy |
| SAR² (TTA) | ImageNet-C, CIFAR-C | variance ā, acc ā | Stable under wild distribution shifts |
On CIFAR-10 and WideResNet-28-10, CSAM achieves ECE ā 0.50% versus SAM's 0.86% and SGD+T scaling's ā1.71%. Under test-time adaptation on heavily shifted or imbalanced batches, SAR² yields orders-of-magnitude lower variance and mitigates catastrophic representation collapse compared to prior TTA schemes (Tan et al., 29 May 2025, Niu et al., 5 Sep 2025).
6. Implementation and Deployment Considerations
Several best practices ensure the robust deployment of SAR and SAR²:
- Batch-agnostic normalization (GroupNorm/LayerNorm) is essential for stability under small, mixed, or imbalanced batches; BatchNorm often introduces instability in TTA (Niu et al., 5 Sep 2025).
- Stable defaults: entropy filter threshold 2, sharpness radius 3, redundancy and inequity weights scaled to feature dimension and class count, respectively.
- Online monitoring of the moving-average entropy enables model recovery by resetting parameters if collapse is detected.
- Hyperparameter choices generalize across benchmarks; tuning can be performed coarsely.
7. Open Questions and Directions
Research directions include extending SAR analysis beyond cross-entropy objectives, clarifying the temporal emergence of calibration under SAM (early versus late training), and developing single-step or computationally efficient approximations preserving both generalization and calibration. Another unresolved topic is the scalable and robust deployment of SAR-based adaptation in safety-critical or real-time settings (Tan et al., 29 May 2025, Niu et al., 5 Sep 2025).