Oriented Contrastive Denoising Overview
- Oriented contrastive denoising is a paradigm that reframes denoising as a contrastive alignment task using explicit corruption signals.
- It leverages structured orientation cues—such as masking tokens, noise levels, anatomical consistency, and temporal adjacency—to guide positive and negative pairing.
- This approach has demonstrated tangible improvements in video-language pre-training, diffusion model robustness, and low-dose CT imaging.
Oriented contrastive denoising denotes a class of objectives in which denoising is not treated solely as direct reconstruction, but as a contrastive alignment problem whose positives and negatives are explicitly directed by a known source of corruption or correspondence. In the literature covered here, that direction is supplied by artificial masking in video–language pre-training, by relative noise level in diffusion models, by anatomy-aware semantic correspondence in low-dose CT, or by adjacency along a probability-flow trajectory. Taken together, these works suggest that the defining property of the approach is not a single architecture, but the use of a structured orientation signal that specifies which noisy and clean states should be pulled together and which mismatched states should be pushed apart (Luo et al., 2021, Wu et al., 2024, Wang et al., 11 Aug 2025, Lei et al., 22 Jan 2025).
1. General formulation and defining characteristics
A compact comparison of the main formulations is given below.
| Work | Orientation source | Contrastive structure |
|---|---|---|
| CoCo-BERT | artificial [MASK] tokens; true video–sentence pairing |
masked query matched to paired unmasked cross-modal key and to its own unmasked intra-modal key |
| Contrastive Diffusion Training | different amounts of noise; OOD pair | binary classification between and |
| ALDEN | same anatomy; tissue-specific semantics | positive same-coordinate denoised/NDCT pair; negatives from same-coordinate LDCT and cross-location NDCT |
| rRCM | adjacent time-steps on the same PF-ODE trajectory with the same | positive pair ; negatives from other samples in the batch |
The shared pattern is that the denoising target is oriented by a relation that is stronger than generic augmentation invariance. In CoCo-BERT, the relevant corruption is the masking procedure itself. In contrastive diffusion training, the orientation is between noisy marginals with different log-SNR values and the associated OOD failure mode. In ALDEN, the alignment relation is anatomical consistency at matched spatial coordinates, supplemented by negatives designed to suppress residual noise and anatomical misplacement. In rRCM, orientation is temporal and dynamical: the positive pair is restricted to adjacent points on the same diffusion trajectory.
This suggests that oriented contrastive denoising is best understood as a design pattern in which denoising supervision is anchored to a known corruption process or semantic relation. The denoising signal can therefore be expressed in embedding space, in cross-modal representation space, or through a classifier-like objective over noise levels, rather than only through pixel-wise regression.
2. Masking-oriented cross-modal denoising in CoCo-BERT
CoCo-BERT introduces “Contrastive Cross-modal matching and denoising” (CoCo) for video–language pre-training. The proxy objective adds a single unified loss to the standard masked-language modeling and masked-sequence-generation objectives. It has two parts: Inter-modal Contrastive Matching (Co-IM), which encourages a masked video or sentence query to match its paired unmasked sentence or video key and to be distinct from cross-modal negatives, and Intra-modal Contrastive Denoising (Co-ID), which encourages a masked video or sentence query to align with its own unmasked video or sentence key and to be distinct from same-modality negatives (Luo et al., 2021).
For one video–sentence pair in a mini-batch, masked inputs are encoded as queries and unmasked inputs as keys. After projection by a small MLP plus attention, the model obtains , and maintains two cross-batch memory banks of negatives of size . With cosine similarity
and , the losses are
and
0
The architecture uses two query encoders and two key encoders, each a standard Transformer stack with 6 layers. The key encoders have identical architectures but are updated by momentum,
1
with 2. A cross-modal decoder with 6 Transformer blocks sits on top of the query outputs to perform MLM and MSG. Two FIFO memories of size 3 per modality store recent positive keys as negatives.
The training pipeline masks 15% of frame positions and 15% of word tokens to form the queries while keeping unmasked originals as keys. Query encoders produce 4 and 5, key encoders produce 6 and 7, negative sets are built from memory, and the model computes 8, 9, 0, and 1. Gradients are back-propagated through the query encoders and decoder; the key encoders are updated only by momentum. The reported settings are memory bank size 2 per modality, temperature 3, batch size 4, learning rate 5, up to 6 pre-train epochs, and frame sampling of up to 7 frames on TV and up to 8 frames on ACTION, with frame features from ResNet-152 9 0 SlowFast 1.
The paper explicitly explains why this is an “oriented” form of contrastive denoising. The only noise injected is from the artificial [MASK] tokens, and CoCo “orients” denoising specifically against that noise by contrasting each masked representation with its exact unmasked counterpart. At the same time, cross-modal matching is oriented toward the true pairing: each masked video must match its true unmasked sentence rather than any other sentence, and vice versa. CoCo-BERT was pre-trained on TV and ACTION and was evaluated on cross-modal retrieval, video question answering, and video captioning, where the authors report superiority as a pre-trained structure.
3. Noise-level discrimination and OOD denoising in diffusion models
A related formulation appears in contrastive diffusion training, which begins from the claim that diffusion models implicitly define a log-likelihood ratio between noisy marginals and can therefore be interpreted as hidden noise-level classifiers. The noisy family is written as
2
with log-SNR parameter 3. For two noise levels 4, the paper defines a log-likelihood ratio 5 and emphasizes that standard diffusion training only observes 6 on 7, whereas evaluations at mismatched noise levels lie outside the training distribution and degrade denoiser quality (Wu et al., 2024).
The proposed self-supervised contrastive diffusion loss (CDL) turns the implicit classifier for noise levels into a training signal. A binary task is defined with 8 if 9 and 0 if 1, sampled with equal probability. The loss is
2
By inserting the density-in-terms-of-denoiser form, the objective becomes a contrastive MSE-based loss on pairs at two noise levels. In the reported implementation, each training step samples a real image 3, samples 4, flips a fair coin 5, computes approximate log densities via MSE proxies at SNR 6 and at induced noise level 7, and backpropagates 8. The paper characterizes this as “filling in” supervision on the OOD pair 9.
The empirical results target both sequential and parallel sampling. On a parallel sampler with 0 samples, DDPM on CIFAR-10 1 improves from FID 2 to 3, VP from 4 to 5, VE from 6 to 7, FFHQ 8 VP from 9 to 0, and FFHQ 1 VE from 2 to 3. On the sequential deterministic EDM sampler with 4 samples, the gains are modest but consistent, for example VP on FFHQ improves from 5 to 6 at NFE 7, and VE on FFHQ improves from 8 to 9 at NFE 0. In the 2D Dino synthetic example at target MMD 1, the number of Picard iterations drops from 2 to 3, NFE from 4 to 5, and wall-time from 6 to 7.
Within the broader topic, this work generalizes contrastive denoising beyond paired clean/noisy reconstructions. The orientation is supplied by noise level itself and by the specific OOD discrepancy created when the denoiser is queried off the standard forward path.
4. Anatomy-aware semantic contrastive denoising in low-dose CT
ALDEN formulates an anatomy-aware low-dose CT denoising pipeline built on a GAN backbone with two additional components: an Anatomy-Aware Discriminator (AAD) and a Semantic-Guided Contrastive Learning (SCL) module. The generator adopts ESAU-Net and maps a low-dose CT slice 8 to 9 under pixel-wise supervision
0
where 1 is the paired normal-dose CT. The discriminator is conditioned on hierarchical semantic features extracted from the reference NDCT by a fixed pretrained vision model 2, for example DINOv2 or MedSAM. Three levels of embeddings, 3, are taken from transformer layers 4, 5, and 6, while the discriminator feature maps are denoted 7. At each level, an Attention-based Feature Fusion module cross-attends semantic priors and discriminator features to form anatomy-aware features 8 (Wang et al., 11 Aug 2025).
The adversarial game is
9
The contrastive component acts on PVM feature embeddings. For each batch of size 0, the fixed PVM extracts feature tensors from the LDCT input 1, the denoised output 2, and the reference NDCT 3. At randomly sampled spatial coordinates, ALDEN constructs one positive set and two negative sets: same-location denoised/NDCT pairs preserve structure, same-location denoised/LDCT pairs penalize residual noise, and different-location denoised/NDCT pairs penalize anatomical misalignment. The InfoNCE-style objective is
4
with cosine similarities and 5.
The total objective is
6
with 7 and 8. Training uses Adam with 9, 00, learning rate 01, batch size 02, and 03 iterations.
The reported quantitative results are explicit. On Mayo2016, ALDEN-DINOv2 achieves PSNR 04 dB, SSIM 05, RMSE 06, and LPIPS 07. On the in-house MCTD dataset, ALDEN-DINOv2 leads in SSIM 08, RMSE 09, and LPIPS 10. The Mayo2016 ablation study reports the following progression: the baseline ESAU-Net+GAN yields PSNR 11, SSIM 12, RMSE 13, LPIPS 14; adding AAD only gives 15, 16, 17, 18; adding SCL only gives 19, 20, 21, 22; and ALDEN with both components gives 23, 24, 25, 26. On the downstream multi-organ segmentation task with TotalSegmentator’s test set of 27 CTs and 28 organs, the Dice score is 29 in the low-noise scenario and 30 in the high-noise scenario, the latter reported as best with 31 over the next best method.
In this formulation, orientation is semantic and spatial. Positive pairing is restricted to matched anatomy at the same coordinates, while the dual negatives explicitly target residual LDCT noise and cross-location anatomical mismatch. The paper presents this as a way to preserve tissue-specific patterns without requiring manual segmentation labels.
5. Trajectory-oriented latent denoising in robust representation consistency models
rRCM reformulates denoising along diffusion trajectories as a discriminative latent-space problem connected to randomized smoothing. The underlying forward SDE is
32
with probability-flow ODE
33
After discretization at times 34, noisy points satisfy 35, and rRCM uses instance discrimination to align temporally adjacent points along the same trajectory (Lei et al., 22 Jan 2025).
The encoder is a Vision Transformer with time embedding, denoted 36, followed by a linear head 37 for logits and, during pre-training only, a 3-layer MLP projector 38. With normalized embeddings, the oriented consistency term is
39
Here the positive pair is 40, where both points use the same Gaussian noise 41 and differ only by adjacent time-step. In parallel, the model applies a standard augmentation contrastive loss using two augmented views of the clean image. The joint pre-training objective minimizes the sum of the consistency and augmentation contrastive terms.
The paper defines orientation very narrowly: because the PF-ODE yields a unique continuous trajectory for each clean image, rRCM draws positives only among adjacent time-steps on the same trajectory. The ablations state that pairing points with different 42 or pairing non-adjacent time-steps breaks that orientation and yields inferior alignment. Model sizes reported for ImageNet are rRCM-S with 43M parameters, rRCM-B with 44M, and rRCM-B-Deep with 45M. The temperature is 46 for both consistency and augmentation losses. Pre-training uses ImageNet for 47k steps with batch size 48 and CIFAR-10 for 49k steps with batch size 50, using AdamW with learning rate 51; fine-tuning uses 52 epochs on ImageNet and 53 epochs on CIFAR-10.
For randomized smoothing, once the classifier is fixed, the smoothed classifier is
54
and the certified radius is
55
The paper reports that the method outperforms the certified accuracy of diffusion-based methods on ImageNet across all perturbation radii by 56 on average, with up to 57 at larger radii, while reducing inference costs by 58 on average. On ImageNet, rRCM-B reports certified accuracies 59 at radii 60, with latency 61 s / 62 s†, while rRCM-B-Deep reports 63 with latency 64 m 65 s. On CIFAR-10, rRCM-B reports latency 66 s and certified accuracies 67 at radii 68.
This work places oriented contrastive denoising in a robustness setting rather than a generative one. The denoising effect is implicit: instead of explicitly reconstructing a clean sample, the model learns representation consistency as one moves backward along the diffusion path, enabling one-shot denoising-and-classification.
6. Conceptual scope, common misconceptions, and open directions
The literature does not present a single universal objective for oriented contrastive denoising. Instead, each formulation instantiates orientation differently: masking noise and true cross-modal pairing in CoCo-BERT, noise-level discrimination and OOD correction in contrastive diffusion training, anatomy-aware spatial semantics in ALDEN, and adjacent same-trajectory consistency in rRCM (Luo et al., 2021, Wu et al., 2024, Wang et al., 11 Aug 2025, Lei et al., 22 Jan 2025).
A common misconception is that contrastive denoising is equivalent to generic augmentation-based self-supervision. The cited formulations do not support that interpretation. Their positives are exact unmasked counterparts, clean-versus-extra-noisy samples, same-location anatomical matches, or adjacent points on the same diffusion trajectory. Their negatives are likewise structured: FIFO memory-bank negatives from other videos or sentences, mismatched noise levels, same-location LDCT residuals, cross-location anatomy mismatches, or other samples in the batch. This suggests that the critical ingredient is not contrastive learning in the abstract, but the specification of a domain-valid orientation relation.
A second misconception is that denoising must be expressed only in pixel space. CoCo-BERT performs denoising at the sequence-representation level while coupling it to cross-modal matching. ALDEN applies contrastive loss to pretrained semantic feature embeddings and couples it to an anatomy-aware discriminator. rRCM operates on latent representations of a time-conditioned ViT. Contrastive diffusion training derives a classifier-like objective from log-likelihood ratios between noisy marginals. The underlying denoising mechanism is therefore representation-level in several of the reported systems.
The open issues named in the literature are also domain-specific. ALDEN identifies dependence on PVM domain gap, compute overhead from cross-attention in the discriminator and contrastive sampling, the possibility that random negative sampling may miss rare tissue patterns, the extension from 2D slices to full 3D volumes, and joint finetuning of the PVM as future directions (Wang et al., 11 Aug 2025). Contrastive diffusion training isolates denoiser degradation in regions far outside the training distribution as a core sampling problem, especially for parallel sampling (Wu et al., 2024). rRCM shows that mis-specified orientation—non-adjacent time-steps or different noise realizations—reduces performance (Lei et al., 22 Jan 2025). CoCo-BERT begins from the argument that masked inputs “would inevitably introduce noise for cross-modal matching proxy task,” motivating a denoising objective specifically oriented to that masking process (Luo et al., 2021).
Across these works, oriented contrastive denoising emerges as a framework for turning known structure in corruption, pairing, or dynamics into a discriminative training signal. The orientation signal determines what counts as faithful recovery, whether that recovery is defined as true video–sentence alignment, low-OOD denoiser behavior, anatomical consistency, or invariant representation along a diffusion trajectory.