Papers
Topics
Authors
Recent
Search
2000 character limit reached

SICL: Style Invariance as Correctness Likelihood

Updated 15 December 2025
  • The paper introduces SICL, a post-hoc uncertainty calibration method that estimates instance-wise prediction correctness via style perturbations.
  • It leverages controlled, content-preserving transformations to measure consistency across style-altered predictions using a simple scoring mechanism.
  • SICL achieves significant reductions in Expected Calibration Error (ECE) and integrates seamlessly with any test-time adaptation framework without backpropagation.

Style Invariance as a Correctness Likelihood (SICL) is a post-hoc uncertainty calibration framework for @@@@1@@@@ (TTA) of deep learning models. SICL leverages the principle that model predictions should be invariant to style alterations—perturbations orthogonal to task-relevant content—while remaining sensitive to content changes, thereby estimating the instance-wise likelihood of prediction correctness. This plug-and-play, backpropagation-free module can be used with any TTA method, addressing the challenge of predictive miscalibration arising when models are continually adapted on non-stationary, unlabeled test distributions (Nam et al., 8 Dec 2025).

1. Theoretical Rationale

TTA procedures adapt deployed models on-the-fly to out-of-distribution data predominantly via entropy minimization or similar objectives, but this often results in overconfident and miscalibrated outputs, especially under distributional shift. Conventional calibration techniques, including temperature scaling and MC Dropout, generally presuppose frozen model parameters or static data distributions, and are therefore inadequate for calibration under dynamic test-time adaptation.

SICL is conceptually grounded in the “style–content” disentanglement established in representation learning (e.g., Gatys et al., Huang & Belongie, von Kügelgen et al.). Let xx denote input data, generated by two independent factors: cc (content, task-relevant) and ss (style, nuisance). In this causal model:

  • An ideal classifier’s decision is exclusively a function of cc.
  • Style-perturbing xx while preserving cc should not affect the model’s prediction.

From this, prediction consistency under controlled, content-preserving style transformations can serve as a proxy for a model’s correctness: high consistency across style-altered versions of xx correlates with genuine confidence, while discordance signals unreliability, especially for samples near decision boundaries or under covariate shift.

2. Mathematical Formulation

Given a model ff (potentially updated per TTA) with softmax output p(x)ΔK1p(x) \in \Delta^{K-1} for KK classes, SICL assesses correctness likelihood as follows:

  • Generate NN style-perturbed variants {τj(x)}j=1N\{\tau_j(x)\}_{j=1}^N of xx via feature-level perturbations (see Section 3).
  • Let p0=p(x)p_0 = p(x) and pj=p(τj(x))p_j = p(\tau_j(x)) for j=1,,Nj=1,\ldots,N, and corresponding predictions y^0=argmaxkp0[k]\hat y_0 = \arg\max_k p_0[k], y^j=argmaxkpj[k]\hat y_j = \arg\max_k p_j[k].

The vanilla style-invariance score is

γstyle(x)=1Nj=1N1{y^j=y^0}\gamma^{\rm style}(x) = \frac{1}{N} \sum_{j=1}^N \mathbf{1}\{\hat y_j = \hat y_0\}

Alternative consistency metrics include L1(x)=11Nj=1Np0pj1L_{\ell_1}(x) = 1 - \frac{1}{N} \sum_{j=1}^N \|p_0 - p_j\|_1 or a KL-based score, but γstyle\gamma^{\rm style} is preferred in practice.

To mitigate spurious agreement due to style or feature collapse during adaptation, SICL introduces a content-relaxation:

  • Generate NN content-perturbed variants {ζj(x)}\{\zeta_j(x)\} (whiten, add noise, reproject).
  • Compute

γcontent(x)=1Nj=1N1{y^(ζj(x))=y^0}\gamma^{\rm content}(x) = \frac{1}{N} \sum_{j=1}^N \mathbf{1}\{\hat y(\zeta_j(x)) = \hat y_0\}

  • The relaxation weight is ωrelax(x)=1γcontent(x)\omega_{\rm relax}(x) = 1 - \gamma^{\rm content}(x).
  • The final correctness likelihood is

L(x)=ωrelax(x)γstyle(x)L(x) = \omega_{\rm relax}(x) \cdot \gamma^{\rm style}(x)

yielding L(x)[0,1]L(x) \in [0,1] as a calibrated, instance-wise confidence.

3. SICL Algorithmic Workflow

For each sample xx after any TTA update, SICL proceeds as follows:

  1. Forward xx through ff to obtain p0p_0, y^0\hat y_0.
  2. Extract early feature tensor zz (e.g., first ResNet block or ViT patch embedding).
  3. Compute channel-wise mean μ\mu and stddev σ\sigma of zz.
  4. For j=1...Nj=1...N, sample Gaussian noise (ϵμ,ϵσ)(\epsilon_\mu, \epsilon_\sigma), compute style-perturbed statistics μ=μ+δϵμ\mu' = \mu + \delta\epsilon_\mu, σ=σ+δϵσ\sigma' = \sigma + \delta\epsilon_\sigma, and reconstruct feature z=σ((zμ)/σ)+μz' = \sigma'((z-\mu)/\sigma) + \mu'. Reconstruct xjx_j' and obtain pjp_j, y^j\hat y_j.
  5. Compute γstyle\gamma^{\rm style} as the fraction of y^j\hat y_j matching y^0\hat y_0.
  6. (Optional) For j=1...Nj=1...N generate content-perturbed features by whitening zz, adding noise, re-projecting, and evaluating model outputs to compute γcontent\gamma^{\rm content}.
  7. Set ωrelax=1γcontent\omega_{\rm relax} = 1 - \gamma^{\rm content} and L(x)=ωrelaxγstyleL(x) = \omega_{\rm relax} \cdot \gamma^{\rm style}.

SICL operates without gradient updates; the computational overhead is NN additional forward-passes per instance (typically N=5N=5–$20$).

4. Integration and Applicability

SICL is architected for plug-and-play deployment:

  • Interface: L(x)L(x) is computed per sample after each test-time adaptation step, replacing or modulating softmax confidence. TTA algorithms (e.g., entropy minimization, BatchNorm adaptation) are unaffected.
  • Computation: O(N)O(N) per-sample forward-passes; for batch size BB this is O(BN)O(BN), with full parallelizability across NN.
  • Hyperparameters:
    • NN: number of style/content variants (default 20).
    • δ\delta: scaling of feature statistics noise (empirical std of μ\mu in batch).
    • Feature extraction layer: shallow layers are preferred for style capture.

No additional parameters are learned. SICL is agnostic to backbone architecture and can be applied to any TTA procedure.

5. Empirical Evaluation

SICL was evaluated on CIFAR-10-C, CIFAR-100-C, and ImageNet-C (severity 5), using backbones ResNet-50, ResNet-101, and ViT (Small/Base). TTA methods included TENT, EATA, SAR, RoTTA, and SoTTA (and their LN-only ViT variants). Comparator calibration methods were Vanilla Temperature Scaling (TS), TransCal, PseudoCal, and MC Dropout.

The metric was cumulative Expected Calibration Error (ECE, %), evaluated on two data regimes: Benign (i.i.d. single corruption) and Dynamic (Dirichlet-mixed, time-varying corruptions).

Summary (averaged over 72 conditions):

  • MC Dropout: ECE \approx 15–20% (CIFAR-10-C)
  • PseudoCal: ECE \approx 6–8% (best static TS)
  • SICL: ECE \approx 7.7% (Benign), reducing ECE by 10.4 percentage points from PseudoCal and 15.6 points in Dynamic. Overall, SICL achieves an average reduction in ECE of 13 percentage points relative to the next-best approach.

Table: Excerpt from Benign stream, ResNet-50, CIFAR-10-C

Method TENT EATA SAR RoTTA SoTTA (avg)
Vanilla TS 22.7 22.3 22.2 22.9 22.8
MC Dropout 16.3 11.1 10.2 10.0 10.4
PseudoCal 9.4 7.8 5.2 7.9 8.4
SICL 6.4 6.9 7.5 7.1 7.5

Comparable improvements hold under dynamic regime and across datasets and architectures.

6. Ablation and Qualitative Analysis

Several ablation studies and analyses isolate crucial aspects of SICL:

  • Style vs. Content Perturbation: Using content perturbations as ensemble variants increases ECE fourfold compared to style perturbations; content preservation is critical.
  • Candidate Quality: “ContentVariance” measured by Mahalanobis distance to class centroids is significantly lower for SICL’s style-perturbed variants than for MC Dropout, indicating that SICL preserves task-relevant geometry.
  • Style Variance: Style variance via Gram-matrix loss shows that SICL explores a broader style space than MixStyle or Dropout, yielding a richer ensemble.
  • Component Ablation: Removing the relaxation factor ωrelax\omega_{\rm relax} increases ECE by 1–2 points. Coarse μ\mu-perturbation is the dominant style perturbation; combining μ\mu and σ\sigma is optimal.
  • Sensitivity to N: ECE<<10% is reached with as few as N=5N=5 style variants; performance improves up to N=20N=20.

7. Significance and Implications

SICL provides a lightweight, robust calibration layer for any test-time adaptation protocol. By operationalizing the content-invariance principle through style-perturbed prediction consistency and a collapse relaxation criterion, SICL offers a correctness likelihood that tracks true accuracy under both stationary and dynamically shifting test distributions. This resolves a key limitation of classical calibration under continual adaptation, and achieves substantial improvements on standard benchmarks, reducing calibration error by approximately 13 percentage points compared to the strongest prior alternatives (Nam et al., 8 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Style Invariance as a Correctness Likelihood (SICL).