Adverse Weather Distillation

Updated 4 July 2026

Adverse weather distillation is a technique that transfers clear-weather model knowledge to systems operating under rain, fog, or night conditions.
It employs diverse distillation targets including disparity maps, restoration residuals, and cost-volume statistics to overcome domain shift and weak supervision.
The method’s efficacy is demonstrated across tasks such as image restoration, monocular depth estimation, optical flow, and LiDAR object detection with measurable performance gains.

Adverse weather distillation denotes a family of distillation procedures used to transfer clean-scene, clear-weather, or teacher-model knowledge into models that must operate under haze, rain, snow, fog, night, or related degradations. In the recent literature, the term covers both conventional knowledge distillation—such as teacher–student alignment of outputs, features, cost volumes, or detector responses—and data distillation procedures that construct supervisory pairs from degraded imagery itself. The underlying motivation is consistent across tasks: adverse weather weakens direct supervision, violates photometric assumptions, or induces severe domain shift, so training is reformulated around invariances between clean and degraded observations or around pseudo-labels generated in a more reliable domain (Wang et al., 23 Sep 2025, Jiang et al., 18 May 2025, Zhou et al., 2024, Huang et al., 2024, Tan et al., 2023, Lin et al., 2019, Cheng et al., 2024).

1. Terminological scope and task coverage

Within the literature, adverse weather distillation is not restricted to a single task or a single form of supervision. In self-supervised stereo matching, RoSe names a second training stage “adverse-weather distillation,” in which a teacher trained with scene-correspondence priors on clear/adverse pairs produces pseudo-disparities on clear stereo inputs, and a fresh student is trained on mixed clear or degraded inputs to match those predictions (Wang et al., 23 Sep 2025). In monocular depth estimation, ACDepth uses a multi-granularity knowledge distillation strategy in which a student absorbs knowledge from a clear-trained teacher model and pretrained Depth Anything V2, with feature-wise distillation, ordinal guidance, and feature consistency across degradation types (Jiang et al., 18 May 2025). In adverse weather removal, distillation appears both as continual knowledge replay on a unified network structure and as soft residual transfer from CLIP features into a restoration backbone (Cheng et al., 2024, Tan et al., 2023).

The same logic extends beyond restoration and depth. In adverse weather optical flow, synthetic-domain motion statistics are distilled into a real-domain network by aligning cost-volume correlation histograms and by using pseudo-labels from a synthetic-degraded encoder (Zhou et al., 2024). In LiDAR-based 3D object detection, Sunny-to-Rainy Knowledge Distillation aligns RoI instance features and final detector responses between a sunny teacher and rainy student while adding a noise-aware correction term (Huang et al., 2024). Earlier real-image deraining work uses “data distillation” rather than teacher–student logits: a rainy image is paired first with a coarsely derained soft label and then with a clean image onto which the extracted rain residual is re-applied, yielding a hard rainy–clean pair for shared-network training (Lin et al., 2019).

Setting	Source of distilled knowledge	Distillation target
Continual all-in-one weather removal	Frozen old model and replay buffer	Predictions and principal features
Stereo matching	Clear-pair teacher pseudo-disparity	Student disparity on mixed inputs
Monocular depth estimation	Clear-trained teacher and Depth Anything V2	Multi-scale features, ordinal relations, feature consistency
Optical flow	Synthetic-degraded encoder	Cost-volume histogram and pseudo-flow
3D object detection	Sunny detector	RoI features and detector responses
Real-image deraining	Filtered rainy image and extracted rain residual	Soft and hard supervisory pairs

A common misconception is that adverse weather distillation is merely “logit matching under bad weather.” The surveyed methods contradict that interpretation. Their distilled objects include per-pixel restorations, compressed mid-level embeddings, residual spatial features, disparity maps, cost-volume statistics, RoI features, and even re-synthesized rainy images (Cheng et al., 2024, Wang et al., 23 Sep 2025, Tan et al., 2023, Zhou et al., 2024, Huang et al., 2024, Lin et al., 2019).

2. Distillation targets and objective functions

A central axis of variation is the object being matched. In continual all-in-one adverse weather removal, the old model at stage $t-1$ and the current model at stage $t$ are both evaluated on a stored degraded sample $\bar I$ , yielding

$y^- = \phi^{(t-1)}(F^{(t-1)}(\bar I)), \qquad y^+ = \phi^t(F^t(\bar I)).$

Prediction-level distillation combines an $\ell_1$ term with a contrastive reconstruction regularizer,

$L_{KD}(\bar I)=\|y^- - y^+\|_1 + \beta_2 \cdot L_{CT}(y^+,y^-,\bar I),$

while principal-feature distillation aligns compressed features,

$L_{PKD}(\bar I)=\|\psi(F^{(t-1)}(\bar I))-\psi(F^t(\bar I))\|_1.$

The total continual objective adds these replay losses to the standard single-weather restoration loss on current-task samples (Cheng et al., 2024).

RoSe’s stereo formulation is structurally simpler at the final stage. After Step 1 self-supervised scene-correspondence learning, the model is frozen as a teacher $f$ . For a clear pair $c_i$ , the teacher produces a masked pseudo-disparity $D_i^*=f(c_i)$ . A re-initialized student $t$ 0 is trained on mixed inputs $t$ 1 with

$t$ 2

where $t$ 3 is obtained from a left–right consistency check. The key property is that the student is supervised by the teacher’s clear-scene prediction even when the input is foggy, rainy, or nocturnal (Wang et al., 23 Sep 2025).

ACDepth introduces a broader distillation stack. Its overall objective is

$t$ 4

with $t$ 5 and $t$ 6. The feature-wise term $t$ 7 aligns multi-scale teacher and student features on mixed clear or degraded samples. The consistency term $t$ 8 explicitly ties student degraded features to stop-gradient teacher-clear and student-clear features. The ordinal guidance distillation term $t$ 9 focuses the model on uncertain regions defined by the normalized inverse-depth disagreement between teacher and student, with threshold $\bar I$ 0 and ordinal tolerance $\bar I$ 1 (Jiang et al., 18 May 2025).

Other domains use different alignment spaces. In CLIP-based adverse weather removal, the teacher signal is the residual feature

$\bar I$ 2

computed from CLIP image-encoder features on clean and weathered images; the SAR encoder’s residuals are matched to this quantity via an $\bar I$ 3 loss after channel-wise normalization (Tan et al., 2023). In CH $\bar I$ 4DA-Flow, the distillation target is not the output flow alone but the distribution of sampled cost-volume correlations. Synthetic and real degraded histograms are aligned with

$\bar I$ 5

and pseudo-flow supervision is added through $\bar I$ 6 (Zhou et al., 2024). In SRKD for 3D object detection, instance-feature matching, response distillation, and noise-aware prediction correction are combined as

$\bar I$ 7

with $\bar I$ 8, $\bar I$ 9, and $y^- = \phi^{(t-1)}(F^{(t-1)}(\bar I)), \qquad y^+ = \phi^t(F^t(\bar I)).$ 0 (Huang et al., 2024).

These formulations show that adverse weather distillation is often closer to structured correspondence transfer than to classical classification KD. RoSe explicitly distinguishes its procedure from standard classification distillation by noting that it distills per-pixel restoration outputs rather than class probabilities and adds a contrastive term to pull the new output toward the old output while pushing it away from the degraded input (Cheng et al., 2024). A plausible implication is that weather-robust distillation methods tend to encode restoration or geometry priors directly in the supervisory object, rather than relying on softened categorical outputs.

3. Teacher construction, paired data, and pseudo-supervision

Because real paired labels under adverse weather are scarce, most methods devote substantial effort to constructing teacher signals or pseudo-pairs whose geometry remains valid. RoSe begins from clear-weather stereo datasets—DrivingStereo, MS2, and KITTI—and trains three CycleGAN-Turbo translators $y^- = \phi^{(t-1)}(F^{(t-1)}(\bar I)), \qquad y^+ = \phi^t(F^t(\bar I)).$ 1, $y^- = \phi^{(t-1)}(F^{(t-1)}(\bar I)), \qquad y^+ = \phi^t(F^t(\bar I)).$ 2, and $y^- = \phi^{(t-1)}(F^{(t-1)}(\bar I)), \qquad y^+ = \phi^t(F^t(\bar I)).$ 3 on unpaired samples. Each clear pair $y^- = \phi^{(t-1)}(F^{(t-1)}(\bar I)), \qquad y^+ = \phi^t(F^t(\bar I)).$ 4 is converted into $y^- = \phi^{(t-1)}(F^{(t-1)}(\bar I)), \qquad y^+ = \phi^t(F^t(\bar I)).$ 5 in the target style, and because the translator changes appearance rather than geometry, the original ground-truth disparity is preserved. This yields Adverse-DrivingStereo with $y^- = \phi^{(t-1)}(F^{(t-1)}(\bar I)), \qquad y^+ = \phi^t(F^t(\bar I)).$ 6 images in Clear/Fog/Rain/Night, Adverse-MS2 with $y^- = \phi^{(t-1)}(F^{(t-1)}(\bar I)), \qquad y^+ = \phi^t(F^t(\bar I)).$ 7, and Adverse-KITTI with $y^- = \phi^{(t-1)}(F^{(t-1)}(\bar I)), \qquad y^+ = \phi^t(F^t(\bar I)).$ 8 (Wang et al., 23 Sep 2025).

ACDepth similarly treats adverse weather generation as a prerequisite for robust distillation, but uses a one-step diffusion model built on Stable Diffusion Turbo with LoRA adapters. For each weather condition $y^- = \phi^{(t-1)}(F^{(t-1)}(\bar I)), \qquad y^+ = \phi^t(F^t(\bar I)).$ 9, a clear image $\ell_1$ 0 and prompt $\ell_1$ 1 are processed to obtain

$\ell_1$ 2

Cycle-consistency, adversarial, and identity regularization losses are jointly optimized so that the translated image preserves scene content while adopting target weather statistics. Separate LoRA sets are trained for day $\ell_1$ 3night, day $\ell_1$ 4rain, and related conditions, using approximately $\ell_1$ 5 clear/rain/night examples per domain and $\ell_1$ 6 for RobotCar night (Jiang et al., 18 May 2025).

In adverse weather optical flow and LiDAR detection, teacher generation is explicitly domain-adaptive. CH $\ell_1$ 7DA-Flow bridges clean, synthetic degraded, and real degraded domains. Synthetic fog and rain are used in Clean-Degraded Motion Adaptation, and the resulting synthetic-degraded encoder becomes the source model for Synthetic-Real Motion Adaptation on real degraded imagery (Zhou et al., 2024). SRKD augments sunny Waymo point clouds with DRET, a rain simulation pipeline that combines particle-based splashes in Unity3D with physically based LiDAR scattering and attenuation, producing approximately $\ell_1$ 8 rainy scans offline (Huang et al., 2024).

The oldest of the surveyed methods, “Rain O’er Me,” uses no external teacher at all. Instead, it creates supervision from the rainy image itself. A rainy image $\ell_1$ 9 is downsampled, super-resolved by a pretrained SRDN, combined by element-wise minimum, and refined with a guided filter:

$L_{KD}(\bar I)=\|y^- - y^+\|_1 + \beta_2 \cdot L_{CT}(y^+,y^-,\bar I),$ 0

The result $L_{KD}(\bar I)=\|y^- - y^+\|_1 + \beta_2 \cdot L_{CT}(y^+,y^-,\bar I),$ 1 is a blurred rain-free soft label. The network predicts a rain residual $L_{KD}(\bar I)=\|y^- - y^+\|_1 + \beta_2 \cdot L_{CT}(y^+,y^-,\bar I),$ 2, enhances it to $L_{KD}(\bar I)=\|y^- - y^+\|_1 + \beta_2 \cdot L_{CT}(y^+,y^-,\bar I),$ 3, and adds it to a different clean image $L_{KD}(\bar I)=\|y^- - y^+\|_1 + \beta_2 \cdot L_{CT}(y^+,y^-,\bar I),$ 4 to form $L_{KD}(\bar I)=\|y^- - y^+\|_1 + \beta_2 \cdot L_{CT}(y^+,y^-,\bar I),$ 5, thereby generating a hard rainy–clean pair for a second supervisory branch (Lin et al., 2019).

Continual adverse weather removal replaces synthetic pairing with replay. Its memory buffer $L_{KD}(\bar I)=\|y^- - y^+\|_1 + \beta_2 \cdot L_{CT}(y^+,y^-,\bar I),$ 6 stores only degraded old images, not clean targets, and at the end of each task a uniform random subset of newly seen data is added with equal budget per task. Distillation is then performed against the frozen old model on each replayed sample (Cheng et al., 2024). This suggests that in continual settings, the pseudo-labeling problem is shifted from “how to obtain a clean target” to “how to preserve the old model’s behavior on degraded inputs.”

4. Architectural couplings between distillation and weather robustness

Adverse weather distillation is rarely a stand-alone auxiliary loss; it is usually embedded in architectures designed to expose weather-invariant or degradation-sensitive representations. The continual all-in-one restoration framework adopts FFA-Net as a unified backbone, abstracts it into a feature extractor $L_{KD}(\bar I)=\|y^- - y^+\|_1 + \beta_2 \cdot L_{CT}(y^+,y^-,\bar I),$ 7 and image projector $L_{KD}(\bar I)=\|y^- - y^+\|_1 + \beta_2 \cdot L_{CT}(y^+,y^-,\bar I),$ 8, and adds an auxiliary auto-encoder $L_{KD}(\bar I)=\|y^- - y^+\|_1 + \beta_2 \cdot L_{CT}(y^+,y^-,\bar I),$ 9 pretrained on features of all stored old samples. Its encoder $L_{PKD}(\bar I)=\|\psi(F^{(t-1)}(\bar I))-\psi(F^t(\bar I))\|_1.$ 0 is then frozen and used as a PCA-style projector from $L_{PKD}(\bar I)=\|\psi(F^{(t-1)}(\bar I))-\psi(F^t(\bar I))\|_1.$ 1 to $L_{PKD}(\bar I)=\|\psi(F^{(t-1)}(\bar I))-\psi(F^t(\bar I))\|_1.$ 2, with implementation based on multi-head channel self-attention and a learnable channel selection layer so as to mimic PCA at linear instead of quadratic cost (Cheng et al., 2024).

RoSe couples distillation to a feature extractor enhanced with frozen visual foundation model priors. Outputs from pre-trained ViT blocks are fused with an FPN encoder at scales $L_{PKD}(\bar I)=\|\psi(F^{(t-1)}(\bar I))-\psi(F^t(\bar I))\|_1.$ 3, and an Anti-Adverse Feature Enhancement Module operates through instance normalization, batch normalization, channel-attention fusion, and Fourier-domain amplitude filtering:

$L_{PKD}(\bar I)=\|\psi(F^{(t-1)}(\bar I))-\psi(F^t(\bar I))\|_1.$ 4

$L_{PKD}(\bar I)=\|\psi(F^{(t-1)}(\bar I))-\psi(F^t(\bar I))\|_1.$ 5

The subsequent FFT/iFFT path filters degradation-related amplitude components while preserving phase and structure (Wang et al., 23 Sep 2025).

In CLIP-based adverse weather removal, the Spatially-Adaptive Residual Encoder and CLIP Weather Prior module make distillation explicitly architectural. Each SAR Transformer block applies multi-head self-attention followed by a SARFFN. The SAR module predicts location-varying combinations of basis depthwise kernels,

$L_{PKD}(\bar I)=\|\psi(F^{(t-1)}(\bar I))-\psi(F^t(\bar I))\|_1.$ 6

$L_{PKD}(\bar I)=\|\psi(F^{(t-1)}(\bar I))-\psi(F^t(\bar I))\|_1.$ 7

so the residual branch concentrates on degraded regions. The CWP module then injects sample-specific CLIP priors $L_{PKD}(\bar I)=\|\psi(F^{(t-1)}(\bar I))-\psi(F^t(\bar I))\|_1.$ 8 and distribution-specific embeddings $L_{PKD}(\bar I)=\|\psi(F^{(t-1)}(\bar I))-\psi(F^t(\bar I))\|_1.$ 9 through cross-attention, with a side cross-entropy weather-classification loss to regularize the prior (Tan et al., 2023).

CH $f$ 0DA-Flow and SRKD exhibit analogous couplings in motion and detection systems. CH $f$ 1DA-Flow uses a FlowFormer backbone with cost-volume correlations, a DispNet stereo network, and PoseNet, so that synthetic $f$ 2real distillation operates directly on motion statistics in the cost space (Zhou et al., 2024). SRKD uses identical teacher and student detector architectures, but ties distillation to object-level similarity measures, high-confidence response selection, and explicit noise ratios inside predicted boxes (Huang et al., 2024). A plausible implication is that adverse weather distillation is most effective when the intermediate representation chosen for transfer corresponds to the physical locus of the weather corruption: image residuals for restoration, disparities for stereo, cost volumes for flow, and point-density responses for LiDAR detection.

5. Quantitative behavior across representative systems

The strongest quantitative evidence in the surveyed literature comes from settings where weaker replay or weaker self-supervision is directly compared to stronger distillation. In continual all-in-one adverse weather removal on the three-task sequence Haze $f$ 3 Rain $f$ 4 Snow, Joint-M, which uses replay without distillation, reaches $f$ 5 dB average PSNR; adding only prediction KD raises this to $f$ 6 dB; adding principal-feature KD further raises it to $f$ 7 dB, only $f$ 8 dB below the “oracle” Individual-task upper bound of $f$ 9 dB. On the two-task sequence Haze $c_i$ 0 Rain, the full model scores $c_i$ 1 dB versus $c_i$ 2 dB for AFC and $c_i$ 3 dB for Individual-task training. The same study reports stable results down to $c_i$ 4– $c_i$ 5 exemplars, and that after a second task the first-task performance drops by only approximately $c_i$ 6 dB, while after the third task each earlier task retains approximately $c_i$ 7 of its original PSNR (Cheng et al., 2024).

RoSe reports clear gains from its Step 2 adverse-weather distillation. On DrivingStereo weather validation, the mixed-training Step 2 model achieves Clear $c_i$ 8 EPE and $c_i$ 9 Bad-3.0, Fog $D_i^*=f(c_i)$ 0 and $D_i^*=f(c_i)$ 1, Rain $D_i^*=f(c_i)$ 2 and $D_i^*=f(c_i)$ 3, and Night $D_i^*=f(c_i)$ 4 and $D_i^*=f(c_i)$ 5. In zero-shot generalization, it attains Bad-3.0 of $D_i^*=f(c_i)$ 6 on DrivingStereo Clear/Fog/Rain and $D_i^*=f(c_i)$ 7 on MS2 Night. Ablations attribute approximately $D_i^*=f(c_i)$ 8– $D_i^*=f(c_i)$ 9 relative improvement to Step 2 distillation after prior gains from VFM fusion, scene-correspondence losses, and AFEM (Wang et al., 23 Sep 2025).

ACDepth reports improvements in monocular depth under rain and night on nuScenes. Against md4all-DD, rainy absRel decreases from $t$ 00 to $t$ 01 and night absRel from $t$ 02 to $t$ 03, corresponding to $t$ 04 and $t$ 05. Its ablations show that on nuScenes night, a baseline without distillation has absRel $t$ 06, adding $t$ 07 reduces this to $t$ 08, adding $t$ 09 gives $t$ 10, and adding $t$ 11 gives $t$ 12; on rain, the sequence is $t$ 13 (Jiang et al., 18 May 2025).

For weather removal, CLIP-based residual distillation also shows measurable gains. On Snow100K-L, Test1, and RainDrop, the full method achieves mean PSNR $t$ 14 dB and SSIM $t$ 15, compared with TransWeather’s $t$ 16. In ablations, a baseline at $t$ 17 PSNR improves to $t$ 18 with CutMix, $t$ 19 with CWP, and $t$ 20 with SAR plus CLIP-SRD. On SPA-rain without retraining, it reports approximately $t$ 21 versus TransWeather’s $t$ 22 (Tan et al., 2023).

Motion and detection studies report similarly consistent trends. CH $t$ 23DA-Flow achieves EPE approximately $t$ 24 on Weather-GOF rain versus $t$ 25 for the best prior method, and $t$ 26 on DenseFog versus $t$ 27; on Real-Weather World it obtains EPE approximately $t$ 28– $t$ 29 and F1-all approximately $t$ 30– $t$ 31, compared with $t$ 32– $t$ 33 EPE and $t$ 34– $t$ 35 F1 for direct baselines (Zhou et al., 2024). In LiDAR 3D detection on WOD-DA, DSVT improves from All(L2) $t$ 36 mAP/mAPH to $t$ 37 with DRET-Aug plus SRKD; Voxel-RCNN improves from $t$ 38 to $t$ 39; PV-RCNN++ improves from $t$ 40 to $t$ 41. The same framework also slightly improves sunny All(L2-mAP), for example DSVT $t$ 42 (Huang et al., 2024).

“Rain O’er Me” does not report paired PSNR/SSIM on real scenes because no paired real ground truth exists, but emphasizes model compactness and speed: $t$ 43 parameters, CPU inference time $t$ 44 s, and GPU inference time $t$ 45 s on a $t$ 46 image (Lin et al., 2019).

6. Conceptual distinctions, limitations, and open issues

Several conceptual distinctions recur across the literature. First, adverse weather distillation is not equivalent to domain adaptation, although the two are frequently combined. CH $t$ 47DA-Flow explicitly frames its method as cumulative homogeneous-heterogeneous adaptation, yet the synthetic $t$ 48real transfer stage is implemented through knowledge distillation on cost-volume histograms and pseudo-flow labels (Zhou et al., 2024). SRKD likewise relies on rainy augmentation and detector supervision, but its key cross-weather transfer mechanism is teacher–student alignment between sunny and rainy scans (Huang et al., 2024).

Second, the “teacher” need not be an externally supervised large model. It may be a frozen previous checkpoint in continual learning, a clear-weather model in cross-weather stereo or depth, a synthetic-domain encoder in optical flow, a pretrained multimodal encoder such as CLIP, or a soft label obtained by filtering the rainy image itself (Cheng et al., 2024, Wang et al., 23 Sep 2025, Jiang et al., 18 May 2025, Tan et al., 2023, Lin et al., 2019). This suggests that adverse weather distillation is better understood as a transfer principle—moving supervision into a representation or domain where it is more stable—than as a fixed architecture.

Third, the surveyed results also delimit the method’s current boundaries. RoSe’s night results remain worse than its clear-weather results, with Bad-3.0 increasing from $t$ 49 on Clear to $t$ 50 on Night even after Step 2 distillation (Wang et al., 23 Sep 2025). ACDepth’s ablation on nuScenes rain shows that adding the ordinal guidance term after $t$ 51 changes absRel from $t$ 52 to $t$ 53, so the benefit of individual submodules can be non-monotonic until the full objective is assembled (Jiang et al., 18 May 2025). The CLIP-based restoration paper identifies a key challenge as reconciling CLIP’s global alignment objective with pixel-level restoration, and therefore delays the distillation loss until epoch $t$ 54 (Tan et al., 2023). “Rain O’er Me” notes that extremely heavy rain or rain plus mist may still leave residual fog, and that the soft-label path can over-smooth textures (Lin et al., 2019). SRKD reports no inference-time penalty, but does incur approximately $t$ 55– $t$ 56 extra GPU training time and $t$ 57– $t$ 58 more VRAM, while DRET is not end-to-end and requires offline preprocessing (Huang et al., 2024).

A final misconception is that distillation under adverse weather always requires task IDs or specialized branches. The continual all-in-one weather removal framework explicitly states that no task ID or specialized branch is needed at test time, because haze, rain, snow, or mixtures are processed by the same $t$ 59– $t$ 60 pipeline (Cheng et al., 2024). Comparable unification is visible in RoSe’s mixed-input student and in ACDepth’s mixed clear plus synthetic degradation training (Wang et al., 23 Sep 2025, Jiang et al., 18 May 2025). A plausible implication is that, as the field matures, the most durable use of adverse weather distillation may be not as a bolt-on regularizer but as a way to train unified backbones whose intermediate representations are anchored by clean-scene priors even when the observable input is severely degraded.