Sharpness-Aware & Reliable Entropy Minimization (SAR)

Updated 5 May 2026

SAR is a training and adaptation paradigm that integrates sharpness-aware minimization with entropy-based corrections to enhance generalization and confidence calibration.
The method employs a min-max optimization approach to encourage convergence to flat minima while explicitly penalizing overconfident predictions through calibrated entropy regularization.
Test-time adaptation with SAR² and additional feature regularizers mitigates representation collapse and stabilizes performance under dynamic, shifted data distributions.

Sharpness-Aware and Reliable Entropy Minimization (SAR) refers to a set of training and adaptation paradigms that jointly leverage the generalization properties of sharpness-aware minimization (SAM) with explicit mechanisms to control confidence calibration and representation collapse, primarily through entropy-based corrections. SAR arises in both standard supervised training and online test-time adaptation. Its core principle is to produce neural networks whose predictive probabilities are robust under distribution shifts and whose confidence estimates reliably track the true accuracy.

1. Theoretical Foundations and Formulation

Sharpness-Aware Minimization (SAM) is formulated as a min-max optimization problem that seeks parameter vectors $\theta$ which are robust to worst-case weight perturbations of magnitude $\rho$ . In contrast to standard stochastic gradient descent (SGD), which minimizes the empirical risk,

$\min_{\theta} L_S(\theta) = \frac{1}{n} \sum_{i=1}^n \ell_\theta(z_i),$

SAM instead solves

$\min_{\theta} \max_{\|\epsilon\|_2 \leq \rho} L_S(\theta+\epsilon).$

At each step, the inner maximization is approximated by a first-order Taylor expansion, yielding

$\epsilon^* = \rho \frac{\nabla L_S(\theta)}{ \|\nabla L_S(\theta)\|_2 },$

and the update for step $k$ is

$\tilde\theta_k = \theta_k + \rho \frac{\nabla L_{\Omega_k}(\theta_k)}{ \|\nabla L_{\Omega_k}(\theta_k)\|_2 }, \qquad \theta_{k+1} = \theta_k - \eta \nabla L_{\Omega_k}(\tilde\theta_k).$

This mechanism encourages convergence to flat minima, improving generalization and robustness.

A key theoretical finding is that, for the standard cross-entropy loss, SAM not only discourages sharp minima but also implicitly regularizes the negative entropy of predictive distributions. Specifically, after the SAM ascent step, the model's confidence in the correct class is strictly reduced, which has the effect of "softening" predictions and thereby combats overconfidence—a prevalent defect in neural classification models (Tan et al., 29 May 2025).

2. Calibration, Implicit Entropy Regularization, and CSAM

Overconfident predictions and miscalibration are addressed in SAR by exploiting SAM's entropy-regularizing effect. For the cross-entropy loss $\ell_\theta(x, y) = -\log p_y(\theta)$ , under mild conditions,

$\tilde p_y = [f_{\tilde\theta}(x)]_y \leq e^{-\rho/2} p_y,$

which forces lower-confidence on the correct class after the weight perturbation.

The entropy bound formalizes that minimizing the worst-case perturbed loss is equivalent to minimizing the standard loss plus an explicit negative entropy penalty:

$\ell_{\tilde\theta}(x, y) \geq -\log p_y - \lambda H(p_y) + H(\tilde p_y),$

where $\rho$ 0 and $\rho$ 1.

The Calibrated SAM (CSAM) variant refines this by adapting the loss to further disincentivize overconfidence, especially for high-confidence predictions, via

$\rho$ 2

which yields an amplified entropy regularization for overconfident predictions (Tan et al., 29 May 2025).

3. Test-Time Adaptation: SAR and SAR² Algorithms

Sharpness-Aware and Reliable Entropy Minimization (SAR) as a Test-Time Adaptation (TTA) method applies these principles to online domain shift. At test time, SAR combines entropy minimization, sharpness-aware robustification, and explicit sample filtering:

The per-sample entropy loss,

$\rho$ 3

is minimized on samples excluded by a mask $\rho$ 4, which filters out examples whose entropy is above a critical threshold (indicating unreliability or noisiness).

A sharpness-aware penalty,

$\rho$ 5

biases the parameter update toward flatter minima by evaluating gradients at a locally perturbed point.

The total batch loss is

$\rho$ 6

where $\rho$ 7 is a balancing hyperparameter.

The SAR² algorithm further prevents representational collapse in wild test streams with two additional centroid-based feature-space regularizers:

Redundancy regularizer $\rho$ 8 minimizes correlations among class prototypes (computed via feature centroids),
Inequity regularizer $\rho$ 9 encourages balanced class assignments by maximizing the entropy of the global centroid prediction.

The sharpness-aware analogs of these regularizers are constructed as maxima over local perturbations, analogous to entropy sharpness.

4. Calibration and Feature Regularization Metrics

SAR's effectiveness hinges upon calibration metrics that quantify the match between predicted confidences and empirical correctness. The principal evaluation is:

Expected Calibration Error (ECE): Test examples are binned by confidence; for each bin $\min_{\theta} L_S(\theta) = \frac{1}{n} \sum_{i=1}^n \ell_\theta(z_i),$ 0, the difference between prediction accuracy and average maximum softmax is

$\min_{\theta} L_S(\theta) = \frac{1}{n} \sum_{i=1}^n \ell_\theta(z_i),$ 1

Adaptive ECE (AdaECE): Adjusts binning per class for finer evaluation.
Temperature-Scaled Calibration Error (TCE): ECE measured after temperature scaling.

Feature space regularization is evaluated by:

Redundancy (mean off-diagonal squared correlation in feature covariance) and
Inequity (entropy deviation from uniform for pooled centroids).

5. Empirical Evaluation

Empirical results on canonical benchmarks demonstrate the effect and stability of SAR:

Approach	Dataset	ECE (%)	Additional Notes
SGD	CIFAR-100	≈3.95	Baseline
SAM	CIFAR-100	≈2.11	~50–70% reduction over SGD
SAM	ImageNet-1K (ViT)	9.72 → 1.76	ECE reduction, Top-1 acc gain 0.3–4%
CSAM	CIFAR-100	≈1.93	Outperforms SAM, maintains accuracy
SAR² (TTA)	ImageNet-C, CIFAR-C	variance ↓, acc ↑	Stable under wild distribution shifts

On CIFAR-10 and WideResNet-28-10, CSAM achieves ECE ≈ 0.50% versus SAM's 0.86% and SGD+T scaling's ≈1.71%. Under test-time adaptation on heavily shifted or imbalanced batches, SAR² yields orders-of-magnitude lower variance and mitigates catastrophic representation collapse compared to prior TTA schemes (Tan et al., 29 May 2025, Niu et al., 5 Sep 2025).

6. Implementation and Deployment Considerations

Several best practices ensure the robust deployment of SAR and SAR²:

Batch-agnostic normalization (GroupNorm/LayerNorm) is essential for stability under small, mixed, or imbalanced batches; BatchNorm often introduces instability in TTA (Niu et al., 5 Sep 2025).
Stable defaults: entropy filter threshold $\min_{\theta} L_S(\theta) = \frac{1}{n} \sum_{i=1}^n \ell_\theta(z_i),$ 2, sharpness radius $\min_{\theta} L_S(\theta) = \frac{1}{n} \sum_{i=1}^n \ell_\theta(z_i),$ 3, redundancy and inequity weights scaled to feature dimension and class count, respectively.
Online monitoring of the moving-average entropy enables model recovery by resetting parameters if collapse is detected.
Hyperparameter choices generalize across benchmarks; tuning can be performed coarsely.

7. Open Questions and Directions

Research directions include extending SAR analysis beyond cross-entropy objectives, clarifying the temporal emergence of calibration under SAM (early versus late training), and developing single-step or computationally efficient approximations preserving both generalization and calibration. Another unresolved topic is the scalable and robust deployment of SAR-based adaptation in safety-critical or real-time settings (Tan et al., 29 May 2025, Niu et al., 5 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Towards Understanding The Calibration Benefits of Sharpness-Aware Minimization (2025)

Adapt in the Wild: Test-Time Entropy Minimization with Sharpness and Feature Regularization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sharpness-Aware and Reliable Entropy Minimization (SAR).