Oriented Contrastive Denoising Overview

Updated 4 July 2026

Oriented contrastive denoising is a paradigm that reframes denoising as a contrastive alignment task using explicit corruption signals.
It leverages structured orientation cues—such as masking tokens, noise levels, anatomical consistency, and temporal adjacency—to guide positive and negative pairing.
This approach has demonstrated tangible improvements in video-language pre-training, diffusion model robustness, and low-dose CT imaging.

Oriented contrastive denoising denotes a class of objectives in which denoising is not treated solely as direct reconstruction, but as a contrastive alignment problem whose positives and negatives are explicitly directed by a known source of corruption or correspondence. In the literature covered here, that direction is supplied by artificial masking in video–language pre-training, by relative noise level in diffusion models, by anatomy-aware semantic correspondence in low-dose CT, or by adjacency along a probability-flow trajectory. Taken together, these works suggest that the defining property of the approach is not a single architecture, but the use of a structured orientation signal that specifies which noisy and clean states should be pulled together and which mismatched states should be pushed apart (Luo et al., 2021, Wu et al., 2024, Wang et al., 11 Aug 2025, Lei et al., 22 Jan 2025).

1. General formulation and defining characteristics

A compact comparison of the main formulations is given below.

Work	Orientation source	Contrastive structure
CoCo-BERT	artificial `[MASK]` tokens; true video–sentence pairing	masked query matched to paired unmasked cross-modal key and to its own unmasked intra-modal key
Contrastive Diffusion Training	different amounts of noise; OOD pair $(z_\zeta,\beta)$	binary classification between $x\sim p_0$ and $x\sim p_\zeta$
ALDEN	same anatomy; tissue-specific semantics	positive same-coordinate denoised/NDCT pair; negatives from same-coordinate LDCT and cross-location NDCT
rRCM	adjacent time-steps on the same PF-ODE trajectory with the same $\epsilon$	positive pair $(x_{t_n}^i,x_{t_{n-1}}^i)$ ; negatives from other samples in the batch

The shared pattern is that the denoising target is oriented by a relation that is stronger than generic augmentation invariance. In CoCo-BERT, the relevant corruption is the masking procedure itself. In contrastive diffusion training, the orientation is between noisy marginals with different log-SNR values and the associated OOD failure mode. In ALDEN, the alignment relation is anatomical consistency at matched spatial coordinates, supplemented by negatives designed to suppress residual noise and anatomical misplacement. In rRCM, orientation is temporal and dynamical: the positive pair is restricted to adjacent points on the same diffusion trajectory.

This suggests that oriented contrastive denoising is best understood as a design pattern in which denoising supervision is anchored to a known corruption process or semantic relation. The denoising signal can therefore be expressed in embedding space, in cross-modal representation space, or through a classifier-like objective over noise levels, rather than only through pixel-wise regression.

CoCo-BERT introduces “Contrastive Cross-modal matching and denoising” (CoCo) for video–language pre-training. The proxy objective adds a single unified loss to the standard masked-language modeling and masked-sequence-generation objectives. It has two parts: Inter-modal Contrastive Matching (Co-IM), which encourages a masked video or sentence query to match its paired unmasked sentence or video key and to be distinct from cross-modal negatives, and Intra-modal Contrastive Denoising (Co-ID), which encourages a masked video or sentence query to align with its own unmasked video or sentence key and to be distinct from same-modality negatives (Luo et al., 2021).

For one video–sentence pair in a mini-batch, masked inputs are encoded as queries and unmasked inputs as keys. After projection by a small MLP plus attention, the model obtains $q_V^{(m)}, q_S^{(m)}, k_V^+, k_S^+ \in \mathbb{R}^d$ , and maintains two cross-batch memory banks of negatives of size $K$ . With cosine similarity

$\langle x,y\rangle = (x^\top y)/(\|x\|\|y\|)$

and $s(x,y)=\exp(\langle x,y\rangle/\tau)$ , the losses are

$L_{Co\text{-}IM}=L_{NCE}^{V\to S}+L_{NCE}^{S\to V}, \qquad L_{Co\text{-}ID}=L_{NCE}^{V}+L_{NCE}^{S},$

and

$x\sim p_0$ 0

The architecture uses two query encoders and two key encoders, each a standard Transformer stack with 6 layers. The key encoders have identical architectures but are updated by momentum,

$x\sim p_0$ 1

with $x\sim p_0$ 2. A cross-modal decoder with 6 Transformer blocks sits on top of the query outputs to perform MLM and MSG. Two FIFO memories of size $x\sim p_0$ 3 per modality store recent positive keys as negatives.

The training pipeline masks 15% of frame positions and 15% of word tokens to form the queries while keeping unmasked originals as keys. Query encoders produce $x\sim p_0$ 4 and $x\sim p_0$ 5, key encoders produce $x\sim p_0$ 6 and $x\sim p_0$ 7, negative sets are built from memory, and the model computes $x\sim p_0$ 8, $x\sim p_0$ 9, $x\sim p_\zeta$ 0, and $x\sim p_\zeta$ 1. Gradients are back-propagated through the query encoders and decoder; the key encoders are updated only by momentum. The reported settings are memory bank size $x\sim p_\zeta$ 2 per modality, temperature $x\sim p_\zeta$ 3, batch size $x\sim p_\zeta$ 4, learning rate $x\sim p_\zeta$ 5, up to $x\sim p_\zeta$ 6 pre-train epochs, and frame sampling of up to $x\sim p_\zeta$ 7 frames on TV and up to $x\sim p_\zeta$ 8 frames on ACTION, with frame features from ResNet-152 $x\sim p_\zeta$ 9 $\epsilon$ 0 SlowFast $\epsilon$ 1.

The paper explicitly explains why this is an “oriented” form of contrastive denoising. The only noise injected is from the artificial [MASK] tokens, and CoCo “orients” denoising specifically against that noise by contrasting each masked representation with its exact unmasked counterpart. At the same time, cross-modal matching is oriented toward the true pairing: each masked video must match its true unmasked sentence rather than any other sentence, and vice versa. CoCo-BERT was pre-trained on TV and ACTION and was evaluated on cross-modal retrieval, video question answering, and video captioning, where the authors report superiority as a pre-trained structure.

3. Noise-level discrimination and OOD denoising in diffusion models

A related formulation appears in contrastive diffusion training, which begins from the claim that diffusion models implicitly define a log-likelihood ratio between noisy marginals and can therefore be interpreted as hidden noise-level classifiers. The noisy family is written as

$\epsilon$ 2

with log-SNR parameter $\epsilon$ 3. For two noise levels $\epsilon$ 4, the paper defines a log-likelihood ratio $\epsilon$ 5 and emphasizes that standard diffusion training only observes $\epsilon$ 6 on $\epsilon$ 7, whereas evaluations at mismatched noise levels lie outside the training distribution and degrade denoiser quality (Wu et al., 2024).

The proposed self-supervised contrastive diffusion loss (CDL) turns the implicit classifier for noise levels into a training signal. A binary task is defined with $\epsilon$ 8 if $\epsilon$ 9 and $(x_{t_n}^i,x_{t_{n-1}}^i)$ 0 if $(x_{t_n}^i,x_{t_{n-1}}^i)$ 1, sampled with equal probability. The loss is

$(x_{t_n}^i,x_{t_{n-1}}^i)$ 2

By inserting the density-in-terms-of-denoiser form, the objective becomes a contrastive MSE-based loss on pairs at two noise levels. In the reported implementation, each training step samples a real image $(x_{t_n}^i,x_{t_{n-1}}^i)$ 3, samples $(x_{t_n}^i,x_{t_{n-1}}^i)$ 4, flips a fair coin $(x_{t_n}^i,x_{t_{n-1}}^i)$ 5, computes approximate log densities via MSE proxies at SNR $(x_{t_n}^i,x_{t_{n-1}}^i)$ 6 and at induced noise level $(x_{t_n}^i,x_{t_{n-1}}^i)$ 7, and backpropagates $(x_{t_n}^i,x_{t_{n-1}}^i)$ 8. The paper characterizes this as “filling in” supervision on the OOD pair $(x_{t_n}^i,x_{t_{n-1}}^i)$ 9.

The empirical results target both sequential and parallel sampling. On a parallel sampler with $q_V^{(m)}, q_S^{(m)}, k_V^+, k_S^+ \in \mathbb{R}^d$ 0 samples, DDPM on CIFAR-10 $q_V^{(m)}, q_S^{(m)}, k_V^+, k_S^+ \in \mathbb{R}^d$ 1 improves from FID $q_V^{(m)}, q_S^{(m)}, k_V^+, k_S^+ \in \mathbb{R}^d$ 2 to $q_V^{(m)}, q_S^{(m)}, k_V^+, k_S^+ \in \mathbb{R}^d$ 3, VP from $q_V^{(m)}, q_S^{(m)}, k_V^+, k_S^+ \in \mathbb{R}^d$ 4 to $q_V^{(m)}, q_S^{(m)}, k_V^+, k_S^+ \in \mathbb{R}^d$ 5, VE from $q_V^{(m)}, q_S^{(m)}, k_V^+, k_S^+ \in \mathbb{R}^d$ 6 to $q_V^{(m)}, q_S^{(m)}, k_V^+, k_S^+ \in \mathbb{R}^d$ 7, FFHQ $q_V^{(m)}, q_S^{(m)}, k_V^+, k_S^+ \in \mathbb{R}^d$ 8 VP from $q_V^{(m)}, q_S^{(m)}, k_V^+, k_S^+ \in \mathbb{R}^d$ 9 to $K$ 0, and FFHQ $K$ 1 VE from $K$ 2 to $K$ 3. On the sequential deterministic EDM sampler with $K$ 4 samples, the gains are modest but consistent, for example VP on FFHQ improves from $K$ 5 to $K$ 6 at NFE $K$ 7, and VE on FFHQ improves from $K$ 8 to $K$ 9 at NFE $\langle x,y\rangle = (x^\top y)/(\|x\|\|y\|)$ 0. In the 2D Dino synthetic example at target MMD $\langle x,y\rangle = (x^\top y)/(\|x\|\|y\|)$ 1, the number of Picard iterations drops from $\langle x,y\rangle = (x^\top y)/(\|x\|\|y\|)$ 2 to $\langle x,y\rangle = (x^\top y)/(\|x\|\|y\|)$ 3, NFE from $\langle x,y\rangle = (x^\top y)/(\|x\|\|y\|)$ 4 to $\langle x,y\rangle = (x^\top y)/(\|x\|\|y\|)$ 5, and wall-time from $\langle x,y\rangle = (x^\top y)/(\|x\|\|y\|)$ 6 to $\langle x,y\rangle = (x^\top y)/(\|x\|\|y\|)$ 7.

Within the broader topic, this work generalizes contrastive denoising beyond paired clean/noisy reconstructions. The orientation is supplied by noise level itself and by the specific OOD discrepancy created when the denoiser is queried off the standard forward path.

4. Anatomy-aware semantic contrastive denoising in low-dose CT

ALDEN formulates an anatomy-aware low-dose CT denoising pipeline built on a GAN backbone with two additional components: an Anatomy-Aware Discriminator (AAD) and a Semantic-Guided Contrastive Learning (SCL) module. The generator adopts ESAU-Net and maps a low-dose CT slice $\langle x,y\rangle = (x^\top y)/(\|x\|\|y\|)$ 8 to $\langle x,y\rangle = (x^\top y)/(\|x\|\|y\|)$ 9 under pixel-wise supervision

$s(x,y)=\exp(\langle x,y\rangle/\tau)$ 0

where $s(x,y)=\exp(\langle x,y\rangle/\tau)$ 1 is the paired normal-dose CT. The discriminator is conditioned on hierarchical semantic features extracted from the reference NDCT by a fixed pretrained vision model $s(x,y)=\exp(\langle x,y\rangle/\tau)$ 2, for example DINOv2 or MedSAM. Three levels of embeddings, $s(x,y)=\exp(\langle x,y\rangle/\tau)$ 3, are taken from transformer layers $s(x,y)=\exp(\langle x,y\rangle/\tau)$ 4, $s(x,y)=\exp(\langle x,y\rangle/\tau)$ 5, and $s(x,y)=\exp(\langle x,y\rangle/\tau)$ 6, while the discriminator feature maps are denoted $s(x,y)=\exp(\langle x,y\rangle/\tau)$ 7. At each level, an Attention-based Feature Fusion module cross-attends semantic priors and discriminator features to form anatomy-aware features $s(x,y)=\exp(\langle x,y\rangle/\tau)$ 8 (Wang et al., 11 Aug 2025).

The adversarial game is

$s(x,y)=\exp(\langle x,y\rangle/\tau)$ 9

The contrastive component acts on PVM feature embeddings. For each batch of size $L_{Co\text{-}IM}=L_{NCE}^{V\to S}+L_{NCE}^{S\to V}, \qquad L_{Co\text{-}ID}=L_{NCE}^{V}+L_{NCE}^{S},$ 0, the fixed PVM extracts feature tensors from the LDCT input $L_{Co\text{-}IM}=L_{NCE}^{V\to S}+L_{NCE}^{S\to V}, \qquad L_{Co\text{-}ID}=L_{NCE}^{V}+L_{NCE}^{S},$ 1, the denoised output $L_{Co\text{-}IM}=L_{NCE}^{V\to S}+L_{NCE}^{S\to V}, \qquad L_{Co\text{-}ID}=L_{NCE}^{V}+L_{NCE}^{S},$ 2, and the reference NDCT $L_{Co\text{-}IM}=L_{NCE}^{V\to S}+L_{NCE}^{S\to V}, \qquad L_{Co\text{-}ID}=L_{NCE}^{V}+L_{NCE}^{S},$ 3. At randomly sampled spatial coordinates, ALDEN constructs one positive set and two negative sets: same-location denoised/NDCT pairs preserve structure, same-location denoised/LDCT pairs penalize residual noise, and different-location denoised/NDCT pairs penalize anatomical misalignment. The InfoNCE-style objective is

$L_{Co\text{-}IM}=L_{NCE}^{V\to S}+L_{NCE}^{S\to V}, \qquad L_{Co\text{-}ID}=L_{NCE}^{V}+L_{NCE}^{S},$ 4

with cosine similarities and $L_{Co\text{-}IM}=L_{NCE}^{V\to S}+L_{NCE}^{S\to V}, \qquad L_{Co\text{-}ID}=L_{NCE}^{V}+L_{NCE}^{S},$ 5.

The total objective is

$L_{Co\text{-}IM}=L_{NCE}^{V\to S}+L_{NCE}^{S\to V}, \qquad L_{Co\text{-}ID}=L_{NCE}^{V}+L_{NCE}^{S},$ 6

with $L_{Co\text{-}IM}=L_{NCE}^{V\to S}+L_{NCE}^{S\to V}, \qquad L_{Co\text{-}ID}=L_{NCE}^{V}+L_{NCE}^{S},$ 7 and $L_{Co\text{-}IM}=L_{NCE}^{V\to S}+L_{NCE}^{S\to V}, \qquad L_{Co\text{-}ID}=L_{NCE}^{V}+L_{NCE}^{S},$ 8. Training uses Adam with $L_{Co\text{-}IM}=L_{NCE}^{V\to S}+L_{NCE}^{S\to V}, \qquad L_{Co\text{-}ID}=L_{NCE}^{V}+L_{NCE}^{S},$ 9, $x\sim p_0$ 00, learning rate $x\sim p_0$ 01, batch size $x\sim p_0$ 02, and $x\sim p_0$ 03 iterations.

The reported quantitative results are explicit. On Mayo2016, ALDEN-DINOv2 achieves PSNR $x\sim p_0$ 04 dB, SSIM $x\sim p_0$ 05, RMSE $x\sim p_0$ 06, and LPIPS $x\sim p_0$ 07. On the in-house MCTD dataset, ALDEN-DINOv2 leads in SSIM $x\sim p_0$ 08, RMSE $x\sim p_0$ 09, and LPIPS $x\sim p_0$ 10. The Mayo2016 ablation study reports the following progression: the baseline ESAU-Net+GAN yields PSNR $x\sim p_0$ 11, SSIM $x\sim p_0$ 12, RMSE $x\sim p_0$ 13, LPIPS $x\sim p_0$ 14; adding AAD only gives $x\sim p_0$ 15, $x\sim p_0$ 16, $x\sim p_0$ 17, $x\sim p_0$ 18; adding SCL only gives $x\sim p_0$ 19, $x\sim p_0$ 20, $x\sim p_0$ 21, $x\sim p_0$ 22; and ALDEN with both components gives $x\sim p_0$ 23, $x\sim p_0$ 24, $x\sim p_0$ 25, $x\sim p_0$ 26. On the downstream multi-organ segmentation task with TotalSegmentator’s test set of $x\sim p_0$ 27 CTs and $x\sim p_0$ 28 organs, the Dice score is $x\sim p_0$ 29 in the low-noise scenario and $x\sim p_0$ 30 in the high-noise scenario, the latter reported as best with $x\sim p_0$ 31 over the next best method.

In this formulation, orientation is semantic and spatial. Positive pairing is restricted to matched anatomy at the same coordinates, while the dual negatives explicitly target residual LDCT noise and cross-location anatomical mismatch. The paper presents this as a way to preserve tissue-specific patterns without requiring manual segmentation labels.

5. Trajectory-oriented latent denoising in robust representation consistency models

rRCM reformulates denoising along diffusion trajectories as a discriminative latent-space problem connected to randomized smoothing. The underlying forward SDE is

$x\sim p_0$ 32

with probability-flow ODE

$x\sim p_0$ 33

After discretization at times $x\sim p_0$ 34, noisy points satisfy $x\sim p_0$ 35, and rRCM uses instance discrimination to align temporally adjacent points along the same trajectory (Lei et al., 22 Jan 2025).

The encoder is a Vision Transformer with time embedding, denoted $x\sim p_0$ 36, followed by a linear head $x\sim p_0$ 37 for logits and, during pre-training only, a 3-layer MLP projector $x\sim p_0$ 38. With normalized embeddings, the oriented consistency term is

$x\sim p_0$ 39

Here the positive pair is $x\sim p_0$ 40, where both points use the same Gaussian noise $x\sim p_0$ 41 and differ only by adjacent time-step. In parallel, the model applies a standard augmentation contrastive loss using two augmented views of the clean image. The joint pre-training objective minimizes the sum of the consistency and augmentation contrastive terms.

The paper defines orientation very narrowly: because the PF-ODE yields a unique continuous trajectory for each clean image, rRCM draws positives only among adjacent time-steps on the same trajectory. The ablations state that pairing points with different $x\sim p_0$ 42 or pairing non-adjacent time-steps breaks that orientation and yields inferior alignment. Model sizes reported for ImageNet are rRCM-S with $x\sim p_0$ 43M parameters, rRCM-B with $x\sim p_0$ 44M, and rRCM-B-Deep with $x\sim p_0$ 45M. The temperature is $x\sim p_0$ 46 for both consistency and augmentation losses. Pre-training uses ImageNet for $x\sim p_0$ 47k steps with batch size $x\sim p_0$ 48 and CIFAR-10 for $x\sim p_0$ 49k steps with batch size $x\sim p_0$ 50, using AdamW with learning rate $x\sim p_0$ 51; fine-tuning uses $x\sim p_0$ 52 epochs on ImageNet and $x\sim p_0$ 53 epochs on CIFAR-10.

For randomized smoothing, once the classifier is fixed, the smoothed classifier is

$x\sim p_0$ 54

and the certified radius is

$x\sim p_0$ 55

The paper reports that the method outperforms the certified accuracy of diffusion-based methods on ImageNet across all perturbation radii by $x\sim p_0$ 56 on average, with up to $x\sim p_0$ 57 at larger radii, while reducing inference costs by $x\sim p_0$ 58 on average. On ImageNet, rRCM-B reports certified accuracies $x\sim p_0$ 59 at radii $x\sim p_0$ 60, with latency $x\sim p_0$ 61 s / $x\sim p_0$ 62 s†, while rRCM-B-Deep reports $x\sim p_0$ 63 with latency $x\sim p_0$ 64 m $x\sim p_0$ 65 s. On CIFAR-10, rRCM-B reports latency $x\sim p_0$ 66 s and certified accuracies $x\sim p_0$ 67 at radii $x\sim p_0$ 68.

This work places oriented contrastive denoising in a robustness setting rather than a generative one. The denoising effect is implicit: instead of explicitly reconstructing a clean sample, the model learns representation consistency as one moves backward along the diffusion path, enabling one-shot denoising-and-classification.

6. Conceptual scope, common misconceptions, and open directions

The literature does not present a single universal objective for oriented contrastive denoising. Instead, each formulation instantiates orientation differently: masking noise and true cross-modal pairing in CoCo-BERT, noise-level discrimination and OOD correction in contrastive diffusion training, anatomy-aware spatial semantics in ALDEN, and adjacent same-trajectory consistency in rRCM (Luo et al., 2021, Wu et al., 2024, Wang et al., 11 Aug 2025, Lei et al., 22 Jan 2025).

A common misconception is that contrastive denoising is equivalent to generic augmentation-based self-supervision. The cited formulations do not support that interpretation. Their positives are exact unmasked counterparts, clean-versus-extra-noisy samples, same-location anatomical matches, or adjacent points on the same diffusion trajectory. Their negatives are likewise structured: FIFO memory-bank negatives from other videos or sentences, mismatched noise levels, same-location LDCT residuals, cross-location anatomy mismatches, or other samples in the batch. This suggests that the critical ingredient is not contrastive learning in the abstract, but the specification of a domain-valid orientation relation.

A second misconception is that denoising must be expressed only in pixel space. CoCo-BERT performs denoising at the sequence-representation level while coupling it to cross-modal matching. ALDEN applies contrastive loss to pretrained semantic feature embeddings and couples it to an anatomy-aware discriminator. rRCM operates on latent representations of a time-conditioned ViT. Contrastive diffusion training derives a classifier-like objective from log-likelihood ratios between noisy marginals. The underlying denoising mechanism is therefore representation-level in several of the reported systems.

The open issues named in the literature are also domain-specific. ALDEN identifies dependence on PVM domain gap, compute overhead from cross-attention in the discriminator and contrastive sampling, the possibility that random negative sampling may miss rare tissue patterns, the extension from 2D slices to full 3D volumes, and joint finetuning of the PVM as future directions (Wang et al., 11 Aug 2025). Contrastive diffusion training isolates denoiser degradation in regions far outside the training distribution as a core sampling problem, especially for parallel sampling (Wu et al., 2024). rRCM shows that mis-specified orientation—non-adjacent time-steps or different noise realizations—reduces performance (Lei et al., 22 Jan 2025). CoCo-BERT begins from the argument that masked inputs “would inevitably introduce noise for cross-modal matching proxy task,” motivating a denoising objective specifically oriented to that masking process (Luo et al., 2021).

Across these works, oriented contrastive denoising emerges as a framework for turning known structure in corruption, pairing, or dynamics into a discriminative training signal. The orientation signal determines what counts as faithful recovery, whether that recovery is defined as true video–sentence alignment, low-OOD denoiser behavior, anatomical consistency, or invariant representation along a diffusion trajectory.