Real-domain Noise Invariance Learning

Updated 4 July 2026

Real-domain Noise Invariance Learning is a class of methods that make predictions robust by ensuring invariance to real, signal-dependent noise via self-supervised feature and noise-space adaptation.
It employs techniques such as diffusion-based adaptation, dynamic noise library construction, channel shuffling, and residual swapping to mitigate shortcut learning between synthetic and real noise.
Empirical evidence across image restoration, segmentation, and speech tasks demonstrates significant performance gains over traditional synthetic noise models.

Searching arXiv for recent and directly relevant papers on real-domain noise invariance and noise-robust domain adaptation. Real-domain Noise Invariance Learning denotes a class of learning strategies that make predictions, intermediate representations, or restored outputs insensitive to the statistics of real, target-domain noise rather than only to synthetic perturbations. In Ivan-ISTD, the term is defined explicitly as a self-supervised, feature-space regularization strategy that mines actual noise patches from unlabeled target-domain infrared images, constructs a dynamic noise library, and enforces invariance between clean and noise-corrupted feature representations (Li et al., 14 Oct 2025). In adjacent work on image restoration and speech recognition, closely related objectives appear as noise-space domain adaptation with diffusion models, translation of unknown real noise into Gaussian noise, adaptive normalization and transfer from synthetic to real noise, adversarial domain adaptation, and layerwise representation matching, all motivated by the same empirical fact: models trained on synthetic or seen perturbations often degrade when confronted with real degradations that are signal-dependent, spatially correlated, non-stationary, or otherwise out of distribution (Liao et al., 2024).

1. Problem formulation and scope

The central problem is a domain gap between source-domain training data and target-domain test data. In supervised image restoration, training typically uses synthetic pairs $(x^s, y^s)$, for example $y^s+\mathrm{AWGN}\rightarrow x^s$, whereas real degradations $x^r$ differ because of signal-dependent noise, compression artifacts, and uncontrolled lighting. The consequence is that models with low error on synthetic test sets often degrade severely on real noisy or blurry images (Liao et al., 2024). A closely related formulation writes the noisy observation as

$y = x + n_{\mathrm{real}}, \qquad n_{\mathrm{real}} \sim p_{\mathrm{real}}(n),$

with $p_{\mathrm{real}}(n)$ unknown, while Gaussian noise remains spatially uncorrelated and independent of image content (Ha et al., 2024).

Within this setting, invariance means that perturbation-specific information should not dominate the learned representation or the restoration trajectory. In speech recognition, this principle is stated directly: a clean example and its superficially perturbed counterpart should not merely map to the same class; they should map to the same representation (Liang et al., 2018). In low-level vision, the corresponding objective is that synthetic and real inputs should be driven toward a shared clean-image manifold or a shared high-quality distribution, even when the observed degradations differ substantially (Liao et al., 2024).

This scope includes several technically distinct mechanisms. Some methods impose invariance in feature space through explicit penalties on paired clean/noisy representations. Some impose invariance in noise space by using a diffusion model whose denoising error is sensitive to the quality of auxiliary conditions. Others confine noise-specific adaptation to small affine parameters, separate content and noise codes, or explicitly translate unknown real noise into a controlled Gaussian prior. The common thread is that invariance is learned with respect to the nuisance aspects of real noise, not by assuming that handcrafted synthetic perturbations are sufficient.

2. Noise-space domain adaptation via diffusion models

“Denoising as Adaptation” formulates real-domain invariance as noise-space domain adaptation for image restoration (Liao et al., 2024). The core object is a denoising diffusion model $\epsilon_\theta$ that predicts the added noise $\epsilon$ in the DDPM forward process:

$\tilde y = \sqrt{\bar \alpha_t}\,y + \sqrt{1-\bar \alpha_t}\,\epsilon, \qquad \epsilon \sim N(0,I).$

The diffusion model receives not only the noisy input $\tilde y$ but also an auxiliary condition formed by concatenating two candidate clean estimates,

$C = \mathrm{concat}(\hat y^s,\hat y^r),$

where $y^s+\mathrm{AWGN}\rightarrow x^s$0 and $y^s+\mathrm{AWGN}\rightarrow x^s$1 are the synthetic-branch and real-branch outputs of the restoration network $y^s+\mathrm{AWGN}\rightarrow x^s$2.

The paper’s central observation is that “better” conditions lead to lower diffusion prediction error $y^s+\mathrm{AWGN}\rightarrow x^s$3. This yields the diffusion loss

$y^s+\mathrm{AWGN}\rightarrow x^s$4

with

$y^s+\mathrm{AWGN}\rightarrow x^s$5

Because gradients back-propagate through $y^s+\mathrm{AWGN}\rightarrow x^s$6 into both $y^s+\mathrm{AWGN}\rightarrow x^s$7 and $y^s+\mathrm{AWGN}\rightarrow x^s$8, the restoration network is forced to improve both synthetic and real outputs toward a shared target clean distribution.

A technical difficulty is shortcut learning. If the diffusion model can detect which channels come from the synthetic branch and which come from the real branch, or if it can simply match pixels between $y^s+\mathrm{AWGN}\rightarrow x^s$9 and $x^r$0, then the training signal no longer enforces domain invariance. Two countermeasures are introduced. First, a channel-shuffling layer randomly permutes the channel order of $x^r$1 before concatenation, so that $x^r$2 must rely on image content quality rather than channel index. Second, residual-swapping contrastive learning creates hard negatives by exchanging per-pixel residual maps:

$x^r$3

where $x^r$4 and $x^r$5. With positive and negative diffusion predictions $x^r$6 and $x^r$7, the contrastive term is

$x^r$8

The total objective is

$x^r$9

where $y = x + n_{\mathrm{real}}, \qquad n_{\mathrm{real}} \sim p_{\mathrm{real}}(n),$0 is a supervised pixel-wise Charbonnier (or $y = x + n_{\mathrm{real}}, \qquad n_{\mathrm{real}} \sim p_{\mathrm{real}}(n),$1) loss on synthetic pairs and $y = x + n_{\mathrm{real}}, \qquad n_{\mathrm{real}} \sim p_{\mathrm{real}}(n),$2 is gradually increased from $y = x + n_{\mathrm{real}}, \qquad n_{\mathrm{real}} \sim p_{\mathrm{real}}(n),$3 using a sigmoid ramp with typical hyper-parameters $y = x + n_{\mathrm{real}}, \qquad n_{\mathrm{real}} \sim p_{\mathrm{real}}(n),$4 and $y = x + n_{\mathrm{real}}, \qquad n_{\mathrm{real}} \sim p_{\mathrm{real}}(n),$5.

The training procedure couples labeled synthetic data and unlabeled real inputs in the same mini-batch. Synthetic pairs are drawn from DIV2K+Flickr2K+… with AWGN $y = x + n_{\mathrm{real}}, \qquad n_{\mathrm{real}} \sim p_{\mathrm{real}}(n),$6, while unlabeled real inputs come from SIDD for denoising, RealBlur-J for deblurring, and SPA for deraining. Both restoration and diffusion networks use U-Net backbones; optimization uses Adam with learning rate $y = x + n_{\mathrm{real}}, \qquad n_{\mathrm{real}} \sim p_{\mathrm{real}}(n),$7, batch size $y = x + n_{\mathrm{real}}, \qquad n_{\mathrm{real}} \sim p_{\mathrm{real}}(n),$8, a linear diffusion noise schedule $y = x + n_{\mathrm{real}}, \qquad n_{\mathrm{real}} \sim p_{\mathrm{real}}(n),$9 from $p_{\mathrm{real}}(n)$0, EMA on diffusion weights with decay $p_{\mathrm{real}}(n)$1, and $p_{\mathrm{real}}(n)$2 patches with random crop+rotation. After training, $p_{\mathrm{real}}(n)$3 is discarded and only $p_{\mathrm{real}}(n)$4 is used at inference.

This formulation is notable because invariance is not imposed by a separate domain classifier or by direct feature matching. Instead, the multi-step denoising process itself becomes the alignment mechanism: the restoration network is rewarded when both synthetic and real outputs serve as equally informative conditions for denoising toward the clean-image manifold.

3. Dynamic noise libraries and self-supervised feature alignment

In Ivan-ISTD, Real-domain Noise Invariance Learning is the second stage of a doubly wavelet-guided framework for cross-domain infrared small target detection (Li et al., 14 Oct 2025). Its definition is specific and operational: RNIL mines actual noise patches from unlabeled target-domain infrared images, constructs a “dynamic noise library,” and uses that library to teach an encoder–decoder network to ignore target-domain perturbations.

Noise extraction proceeds by randomly selecting $p_{\mathrm{real}}(n)$5 target images, sliding a window of size $p_{\mathrm{real}}(n)$6 over each image, and computing local mean $p_{\mathrm{real}}(n)$7 and variance $p_{\mathrm{real}}(n)$8 for each patch $p_{\mathrm{real}}(n)$9. A patch is selected if $\epsilon_\theta$0 and/or $\epsilon_\theta$1. The collected patches $\epsilon_\theta$2 are then upsampled to full image size by bilinear interpolation and assembled into

$\epsilon_\theta$3

The library is “dynamic” because it reflects the non-stationary, heteroscedastic disturbances in the target domain rather than a fixed synthetic perturbation model.

During training, for each source image $\epsilon_\theta$4, the method samples a noise map $\epsilon_\theta$5 uniformly from the library and a mixing coefficient $\epsilon_\theta$6, then forms

$\epsilon_\theta$7

A dual-branch encoder–decoder $\epsilon_\theta$8 with shared weights processes both the clean branch and the noisy branch:

$\epsilon_\theta$9

Supervision combines a multi-scale binary cross-entropy segmentation loss,

$\epsilon$0

with a self-supervised alignment loss on the final decoded feature map, or an intermediate alignment node,

$\epsilon$1

The total loss is simply

$\epsilon$2

No extra regularization term is introduced.

Architecturally, RNIL uses a five-layer residual downsampling stack $\epsilon$3 and four upsampling+fusion blocks $\epsilon$4. The output of $\epsilon$5 serves as the representation for $\epsilon$6. Both branches share encoder–decoder weights, so the alignment objective is explicitly self-supervision rather than distillation. The paper also emphasizes a negative clarification: although wavelet filtering is heavily used in Stage I for background extraction, RNIL itself operates purely on the reconstructed images and not in the wavelet domain.

This formulation makes the invariance target concrete. Rather than matching source and target domains at the level of global feature distributions, RNIL records target-domain perturbations, injects them into source samples by convex mixing in pixel space, and then forces the decoded features of clean and perturbed inputs to remain close. The intended effect is to overcome the limitations of distribution bias in traditional artificial noise modeling.

4. Alternative mechanisms for enforcing invariance to real noise

Several adjacent methods pursue the same objective through different inductive biases and loss constructions. A noise translation framework first pretrains a denoiser $\epsilon$7 on pure Gaussian noise, then learns a translation network $\epsilon$8 that maps $\epsilon$9 into $\tilde y = \sqrt{\bar \alpha_t}\,y + \sqrt{1-\bar \alpha_t}\,\epsilon, \qquad \epsilon \sim N(0,I).$0. The translated image is restored by the frozen Gaussian denoiser, and the translation network is optimized by an implicit reconstruction term together with explicit $\tilde y = \sqrt{\bar \alpha_t}\,y + \sqrt{1-\bar \alpha_t}\,\epsilon, \qquad \epsilon \sim N(0,I).$1-Wasserstein losses on the spatial and frequency distributions of the translated residual noise. The framework contains no adversarial term; the explicit Wasserstein-based losses perform the role of distribution matching (Ha et al., 2024). This places invariance in a translated-noise space rather than in a shared feature space.

Adaptive Instance Normalization provides a different factorization. In AINDNet, each AIN-ResBlock normalizes a feature map channel-wise and then rescales and shifts it using affine parameters derived from an estimated pixel-wise noise-level map $\tilde y = \sqrt{\bar \alpha_t}\,y + \sqrt{1-\bar \alpha_t}\,\epsilon, \qquad \epsilon \sim N(0,I).$2. During transfer from synthetic noise to real noise, all “content” parameters in the reconstruction network are frozen, while only the noise-level estimator, the AdaIN affine generators, and the final reconstruction convolution are updated. The paper interprets this as learning a noise-invariant feature backbone whose noise-specific adaptation is confined to small affine parameters (Kim et al., 2020).

Unsupervised image restoration has also approached invariance through discrete disentangled representation and adversarial domain adaption. In that setting, noisy inputs are encoded into a content code $\tilde y = \sqrt{\bar \alpha_t}\,y + \sqrt{1-\bar \alpha_t}\,\epsilon, \qquad \epsilon \sim N(0,I).$3 and a noise code $\tilde y = \sqrt{\bar \alpha_t}\,y + \sqrt{1-\bar \alpha_t}\,\epsilon, \qquad \epsilon \sim N(0,I).$4, clean samples are encoded into $\tilde y = \sqrt{\bar \alpha_t}\,y + \sqrt{1-\bar \alpha_t}\,\epsilon, \qquad \epsilon \sim N(0,I).$5, and a latent discriminator $\tilde y = \sqrt{\bar \alpha_t}\,y + \sqrt{1-\bar \alpha_t}\,\epsilon, \qquad \epsilon \sim N(0,I).$6 aligns content codes from noisy and clean domains. Background Consistency Module and Semantic Consistency Module provide extra self-supervised constraints so that the learned representation remains robust under dual domain constraints in both feature and image domains (Du et al., 2020). Here invariance is achieved by disentangling and adversarial alignment rather than by target-domain noise mining.

Speech recognition supplies two further formulations. Invariant-Representation-Learning matches clean and noisy representations with a joint $\tilde y = \sqrt{\bar \alpha_t}\,y + \sqrt{1-\bar \alpha_t}\,\epsilon, \qquad \epsilon \sim N(0,I).$7 and cosine penalty at one layer or cumulatively across layers,

$\tilde y = \sqrt{\bar \alpha_t}\,y + \sqrt{1-\bar \alpha_t}\,\epsilon, \qquad \epsilon \sim N(0,I).$8

and reports that cumulative matching prevents noise-specific features from “creeping in” later in the network (Liang et al., 2018). A more adversarial alternative inserts a gradient-reversal layer between a shared encoder and a noise-condition classifier so that the encoder maximizes domain-classification loss while minimizing ASR loss, thereby producing compact noise-invariant embeddings (Serdyuk et al., 2016).

A related but more abstract theoretical line appears in transform learning. “Pivotal Auto-Encoder via Self-Normalizing ReLU” replaces the quadratic sparse-coding objective by a square-root-lasso-type objective,

$\tilde y = \sqrt{\bar \alpha_t}\,y + \sqrt{1-\bar \alpha_t}\,\epsilon, \qquad \epsilon \sim N(0,I).$9

which decouples $\tilde y$0 from the unknown noise level and yields an unrolled Self-Normalizing ReLU architecture called NeLU (Goldenstein et al., 2024). At the level of distribution comparison rather than restoration, Phase Discrepancy uses the phase of characteristic functions so that additive symmetric positive-definite noise cancels, giving invariant features of distributions robust to measurement noise (Law et al., 2017). These works are not all “real-domain” in the Ivan-ISTD sense, but they clarify that invariance can be imposed at the levels of translated noise, normalized feature statistics, latent codes, sparse transforms, or distributional embeddings.

5. Empirical evidence and ablation findings

Reported quantitative results span image restoration, infrared small-target detection, speech recognition, transfer learning from synthetic to real noise, and noise-level-invariant auto-encoding (Liao et al., 2024, Li et al., 14 Oct 2025, Liang et al., 2018, Kim et al., 2020, Goldenstein et al., 2024).

Setting	Method/variant	Reported result
SIDD denoising	Vanilla / DANN / noise-DA	26.58 dB, 30.09 dB, 34.71 dB
SPA deraining (Y-channel)	Vanilla / Restormer / noise-DA	33.04 dB, 34.17 dB, 34.39 dB
RealBlur-J deblurring	Vanilla / DANN / noise-DA	26.27 dB, 26.11 dB, 26.46 dB
ISTD F1 ablation	Baseline / +Fusion only / +Fusion + Self-sup	84.47%, 84.98%, 85.51%
LibriSpeech test-clean / test-other	Baseline / IRL-C	6.5% / 18.1%, 3.3% / 11.0%
BSD68 at $\tilde y$1	ReLU / NeLU	12.99, 14.57

For “Denoising as Adaptation,” the denoising task on SIDD test reports PSNR $\tilde y$2 dB and SSIM $\tilde y$3 for the Vanilla synthetic-only model, $\tilde y$4 dB for DANN, and $\tilde y$5 dB, SSIM $\tilde y$6, LPIPS $\tilde y$7 for the proposed noise-space domain adaptation. On SPA deraining, the reported numbers are $\tilde y$8 dB / $\tilde y$9 for Vanilla, $C = \mathrm{concat}(\hat y^s,\hat y^r),$0 dB / $C = \mathrm{concat}(\hat y^s,\hat y^r),$1 for Restormer trained on synthetic data, and $C = \mathrm{concat}(\hat y^s,\hat y^r),$2 dB / $C = \mathrm{concat}(\hat y^s,\hat y^r),$3, LPIPS $C = \mathrm{concat}(\hat y^s,\hat y^r),$4 for the proposed method. On RealBlur-J deblurring, the corresponding numbers are $C = \mathrm{concat}(\hat y^s,\hat y^r),$5 dB / $C = \mathrm{concat}(\hat y^s,\hat y^r),$6 for Vanilla, $C = \mathrm{concat}(\hat y^s,\hat y^r),$7 dB for DANN, and $C = \mathrm{concat}(\hat y^s,\hat y^r),$8 dB / $C = \mathrm{concat}(\hat y^s,\hat y^r),$9, LPIPS $y^s+\mathrm{AWGN}\rightarrow x^s$00 for the proposed method. The shortcut-preventing ablation on SIDD is especially diagnostic: the full configuration with noise sampling range $y^s+\mathrm{AWGN}\rightarrow x^s$01, channel shuffling, and residual swapping reaches PSNR $y^s+\mathrm{AWGN}\rightarrow x^s$02 and SSIM $y^s+\mathrm{AWGN}\rightarrow x^s$03; without residual swapping it drops to $y^s+\mathrm{AWGN}\rightarrow x^s$04 dB / $y^s+\mathrm{AWGN}\rightarrow x^s$05; without both channel shuffling and residual swapping it drops to $y^s+\mathrm{AWGN}\rightarrow x^s$06 dB / $y^s+\mathrm{AWGN}\rightarrow x^s$07. The paper further notes that too low noise, $y^s+\mathrm{AWGN}\rightarrow x^s$08, allows shortcuts early, whereas too high noise, $y^s+\mathrm{AWGN}\rightarrow x^s$09, leads to unstable local minima.

For RNIL in Ivan-ISTD, the ablation on noise types compares Composite Noise with the Real-World Noise Library and reports $y^s+\mathrm{AWGN}\rightarrow x^s$10 versus $y^s+\mathrm{AWGN}\rightarrow x^s$11 with similar gains in PixAcc, mIoU and Pd. The self-supervision and fusion ablation reports $y^s+\mathrm{AWGN}\rightarrow x^s$12 for the baseline with no fusion and no $y^s+\mathrm{AWGN}\rightarrow x^s$13, $y^s+\mathrm{AWGN}\rightarrow x^s$14 for dynamic noise fusion without self-supervision, and $y^s+\mathrm{AWGN}\rightarrow x^s$15 for the full version with fusion and self-supervision, together with $y^s+\mathrm{AWGN}\rightarrow x^s$16. The best mixing weight is reported as $y^s+\mathrm{AWGN}\rightarrow x^s$17, giving PixAcc $y^s+\mathrm{AWGN}\rightarrow x^s$18, mIoU $y^s+\mathrm{AWGN}\rightarrow x^s$19, nIoU $y^s+\mathrm{AWGN}\rightarrow x^s$20, Pd $y^s+\mathrm{AWGN}\rightarrow x^s$21, and $y^s+\mathrm{AWGN}\rightarrow x^s$22.

Invariant-Representation-Learning in speech recognition reports character error rates on LibriSpeech of $y^s+\mathrm{AWGN}\rightarrow x^s$23 and $y^s+\mathrm{AWGN}\rightarrow x^s$24 for the baseline on test-clean and test-other, compared with $y^s+\mathrm{AWGN}\rightarrow x^s$25 and $y^s+\mathrm{AWGN}\rightarrow x^s$26 for IRL-C. On out-of-domain perturbations, the improvements are larger: for $y^s+\mathrm{AWGN}\rightarrow x^s$27 dB additive noise, Baseline $y^s+\mathrm{AWGN}\rightarrow x^s$28 DataAug $y^s+\mathrm{AWGN}\rightarrow x^s$29 IRL-C $y^s+\mathrm{AWGN}\rightarrow x^s$30; for WSJ overlap at $y^s+\mathrm{AWGN}\rightarrow x^s$31 dB, Baseline $y^s+\mathrm{AWGN}\rightarrow x^s$32 DataAug $y^s+\mathrm{AWGN}\rightarrow x^s$33 IRL-C $y^s+\mathrm{AWGN}\rightarrow x^s$34; for impulse response, Baseline $y^s+\mathrm{AWGN}\rightarrow x^s$35 DataAug $y^s+\mathrm{AWGN}\rightarrow x^s$36 IRL-C $y^s+\mathrm{AWGN}\rightarrow x^s$37; and for telephony at $y^s+\mathrm{AWGN}\rightarrow x^s$38 kHz, Baseline $y^s+\mathrm{AWGN}\rightarrow x^s$39 DataAug $y^s+\mathrm{AWGN}\rightarrow x^s$40 IRL-C $y^s+\mathrm{AWGN}\rightarrow x^s$41.

AINDNet reports that the purely synthetic-trained model AINDNet(S) achieves PSNR $y^s+\mathrm{AWGN}\rightarrow x^s$42 dB and SSIM $y^s+\mathrm{AWGN}\rightarrow x^s$43 on DND, and $y^s+\mathrm{AWGN}\rightarrow x^s$44 dB after self-ensemble. Under the “transfer-only-AdaIN” scheme with the full SIDD real-noise set, AINDNet+TF reports $y^s+\mathrm{AWGN}\rightarrow x^s$45 dB and SSIM $y^s+\mathrm{AWGN}\rightarrow x^s$46 on SIDD and $y^s+\mathrm{AWGN}\rightarrow x^s$47 dB on DND. When fine-tuning is restricted to only $y^s+\mathrm{AWGN}\rightarrow x^s$48 real pairs, AINDNet+TF still reaches $y^s+\mathrm{AWGN}\rightarrow x^s$49 dB on SIDD. The ablation also states that replacing AdaIN with ordinary InstanceNorm + concatenation drops PSNR by approximately $y^s+\mathrm{AWGN}\rightarrow x^s$50 dB on DND, and freezing AdaIN or the noise-estimator during fine-tuning cripples performance.

The NeLU auto-encoder results are narrower in scope but relevant as evidence that invariance can be targeted at unknown test-time noise levels. On BSD68, models trained at noise $y^s+\mathrm{AWGN}\rightarrow x^s$51 are evaluated at $y^s+\mathrm{AWGN}\rightarrow x^s$52. At the highest level, the conv+ReLU baseline reports PSNR $y^s+\mathrm{AWGN}\rightarrow x^s$53, whereas NeLU reports $y^s+\mathrm{AWGN}\rightarrow x^s$54. The paper states that the classic soft-threshold encoder degrades rapidly for unseen $y^s+\mathrm{AWGN}\rightarrow x^s$55, while NeLU remains stable.

6. Conceptual distinctions, common misconceptions, and open issues

A first misconception is that real-domain noise invariance is equivalent to ordinary data augmentation. The cited literature repeatedly separates the two. In speech, IRL augments data but additionally coerces matched representations at each layer, and the reported gains exceed those of data augmentation alone, especially in out-of-domain noise settings (Liang et al., 2018). In infrared detection, the Real-World Noise Library outperforms Composite Noise, indicating that mining actual target-domain perturbations is not reducible to hand-crafted augmentation (Li et al., 14 Oct 2025). In image denoising, simply adding Gaussian noise to the real noisy image helps some datasets but is described as suboptimal and non-adaptive, whereas explicit translation to Gaussian statistics performs the distribution matching (Ha et al., 2024).

A second misconception is that invariance must be adversarial. Some methods do use a domain discriminator or a gradient-reversal layer, as in adversarial ASR and unsupervised image restoration (Serdyuk et al., 2016, Du et al., 2020). Others explicitly avoid this route. The noise translation framework contains no adversarial term, relying instead on Wasserstein-based spatial and frequency losses (Ha et al., 2024). RNIL in Ivan-ISTD uses only an MSE self-supervised consistency term in addition to the segmentation loss (Li et al., 14 Oct 2025). “Denoising as Adaptation” achieves alignment through diffusion prediction error and a contrastive hard-negative construction rather than through a conventional discriminator (Liao et al., 2024).

A third distinction concerns where invariance is imposed. The literature shows several loci: restored outputs conditioned through a diffusion model, feature maps aligned by MSE or $y^s+\mathrm{AWGN}\rightarrow x^s$56 plus cosine penalties, normalized features modulated by AdaIN parameters, disentangled content/noise codes, translated noisy images designed to obey Gaussian and Rayleigh statistics, and sparse codes obtained from a pivotal objective (Liao et al., 2024, Kim et al., 2020, Goldenstein et al., 2024). A plausible implication is that successful methods tend to isolate nuisance variability into explicitly controlled variables—noise maps, affine style parameters, noise codes, translated residuals, or adversarially suppressed features—while keeping content pathways stable.

Open issues remain. Diffusion-based noise-space adaptation is sensitive to shortcut behavior: too low a diffusion noise range permits early shortcuts, whereas too high a range leads to unstable local minima (Liao et al., 2024). Square-root-lasso-based invariance currently comes with limitations that the paper states explicitly: finite-depth unrolling still relies on backpropagation, computational cost grows with iterations $y^s+\mathrm{AWGN}\rightarrow x^s$57, and the theoretical bounds involve unknown constants that may be loose in practice (Goldenstein et al., 2024). More broadly, the persistence of signal-dependent, spatially varying, and heteroscedastic real noise suggests that invariant learning remains tied to the quality of the factorization one chooses—feature, noise, latent, or distributional. The cited work collectively indicates that real-domain invariance is not a single algorithmic recipe but a design principle for constraining how models represent perturbations that are irrelevant to the downstream task.