Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 147 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 27 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 96 tok/s Pro

Kimi K2 188 tok/s Pro

GPT OSS 120B 398 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Contrastive-SDE in Unpaired Image Translation

Updated 11 October 2025

Contrastive-SDE is a method that employs time-dependent contrastive learning to guide stochastic differential equations for unpaired image translation.
It integrates a two-stage process combining unsupervised contrastive representation learning with score-based diffusion guidance to maintain semantic fidelity.
Empirical results show competitive realism and superior faithfulness metrics with lower computational costs compared to classifier-guided approaches.

Contrastive-SDE refers to a methodology that leverages contrastive learning to guide stochastic differential equations (SDEs) for unpaired image-to-image (I2I) translation tasks. This approach introduces a time-dependent contrastive signal—trained to encode domain-invariant features—that is injected into the sampling trajectory of a pretrained score-based diffusion model to enforce semantic consistency between translated outputs and source images, without use of explicit supervision or paired data (Kotyada et al., 4 Oct 2025).

1. Conceptual Framework and Motivation

Contrastive-SDE is motivated by two converging trends in generative modeling for unpaired I2I: the success of score-based diffusion models for representing complex distributions via SDEs, and the capability of contrastive learning to extract semantic structure in the absence of supervision. Traditional unpaired I2I frameworks, such as adversarial methods (CycleGAN, MUNIT), enforce cross-domain constraints via adversarial or cycle-consistency losses, while more recent diffusion-based strategies employ classifier or energy-based guidance, which typically require training explicit domain classifiers and can be computationally costly.

Contrastive-SDE departs from classifier or energy guidance: it instead uses an unsupervised, time-conditioned contrastive learning module to learn representations that are invariant to domain changes. These learned domain-invariant features are then used to guide the reverse SDE sampling dynamics during image translation, promoting output images that maintain semantic correspondence with the source while discarding domain-specific artifacts.

2. Core Methodology

The approach centers around a two-stage process:

Contrastive Representation Learning:
- A U-Net-based architecture, denoted $F$ , is trained using a time-conditional contrastive loss (SimCLR NT-Xent loss) to map each image $x$ and its low-pass filtered version $\tilde{x}$ to similar feature representations.
- The low-pass filter serves to accentuate domain-invariant image content while suppressing domain-specific (often high-frequency) details.
Score-Based Diffusion Guidance:
- The forward generative process is defined by a stochastic differential equation:
$dy = f(y, t) dt + g(t) dw(t)$

The reverse (sampling) process uses an estimated score function $s_\theta(y, t)$ :

$dy = \left[ f(y, t) - g(t)^2 s_\theta(y, t) \right] dt + g(t) d\hat{w}(t)$
During translation, this reverse SDE is modified by a contrastive guidance term:

$dy = \left[ f(y, t) - g(t)^2 \left( s_\theta(y, t) - \nabla_y Q(y, x_0, t) \right) \right] dt + g(t) d\hat{w}(t)$

Here, $Q(y, x_0, t) = -\lambda S(y, x_0, t)$ , with $S(y, x_0, t)$ calculating a channel-wise cosine similarity between the contrastive features of $y$ and the original source image $x_0$ , both extracted at the current time $t$ :

$S(y, x_0, t) = \frac{1}{MN} \sum_{m=1}^M \sum_{n=1}^N \frac{h_t^{(mn)\top} h_0^{(mn)}}{ \| h_t^{(mn)} \|_2 \| h_0^{(mn)} \|_2 }$

This gradient-based signal ensures that, at each reverse diffusion step, the generated image $y$ remains semantically aligned with the domain-invariant content of $x_0$ .

3. Practical Implementation and Architectural Details

Contrastive Module Training:
- The contrastive model $F$ is trained in advance, using simulated pairs $(x, \tilde{x})$ , with the NT-Xent loss:
$\mathcal{L}_{\text{NT-Xent}} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k\neq i} \exp(\text{sim}(z_i, z_k)/\tau)}$

where $\text{sim}(\cdot, \cdot)$ is cosine similarity, and $\tau$ is the temperature parameter.
Diffusion Model and Guidance Integration:
- The score-based diffusion model is pretrained on the target domain.
- During test-time I2I translation, the trained contrastive network supplies $\nabla_y Q$ , which is injected into the reverse SDE. This requires backpropagation of the contrastive similarity score with respect to the generated image $y$ at each SDE time step.
Guidance Scheduling and Hyperparameters:
- The contrastive guidance influence $\lambda$ and the starting time step (controlling the strength and duration of the contrastive signal) are tunable. Adjusting these hyperparameters enables trading off between output image realism (as measured by FID) and strict semantic faithfulness to the source.

4. Empirical Evaluation and Results

Contrastive-SDE is quantitatively evaluated on public unpaired I2I translation benchmarks, including Cat $\rightarrow$ Dog, Wild $\rightarrow$ Dog, and Male $\rightarrow$ Female (AFHQ, CelebA). Four metrics are used: FID (for realism), L2 distance, PSNR, and SSIM (for faithfulness).

Key empirical observations:

Model	FID (↓)	L2 (↓)	PSNR (↑)	SSIM (↑)
Contrastive-SDE	mod	best	best	best
EGSDE	lower	---	---	---
SDDM	---	---	---	---

Contrastive-SDE achieves state-of-the-art or near-state-of-the-art performance on the faithfulness metrics and achieves competitive FID.
The method converges faster (2K iterations in 2 hours) than classifier-guided methods (e.g., EGSDE: 5K iterations in 7 hours).
No supervised labels or external classifiers are required, streamlining the training process.

5. Comparative Analysis and Trade-offs

Contrastive-SDE contrasts with previous unpaired I2I methods in several ways:

Method Type	Semantic Consistency	Realism (FID)	Computational Cost	Explicit Supervision
Adversarial (CycleGAN)	Partial (Cycle Loss)	Good	High	N
Classifier/EGSDE Guidance	Strong (Classifier)	Best	High	Y
Contrastive-SDE	Strong (Contrastive)	Comparable	Low	N

Limitations:

The use of low-pass filtering for domain-invariant feature selection is not optimal; some domain-specific information may persist, modestly limiting realism.
Improvements could be obtained by developing more sophisticated domain-invariant feature extractors or explicitly penalizing domain-specific details in future iterations.

6. Applications and Prospective Extensions

Contrastive-SDE is applicable to unpaired image translation tasks where no aligned or paired data exists, and domain supervision is limited or unavailable. Specific use cases include:

Animal face translation (e.g., Cat $\rightarrow$ Dog)
Attribute transfer (e.g., Male $\rightarrow$ Female)
Scene translation (suggested: Summer $\rightarrow$ Winter, Horse $\rightarrow$ Zebra)

Future research directions highlighted by the authors include:

Employing more advanced—or non-parametric—domain-invariant feature extractors.
Explicitly incorporating penalty terms for domain-specific features to further boost realism.
Broadening the approach to additional challenging unpaired translation benchmarks.

7. Significance and Positioning within Diffusion Guidance Paradigms

Contrastive-SDE exemplifies a paradigm where perceptual or semantic guidance is supplied during generative diffusion, entirely via self-supervised contrastive learning. This stands in contrast to classifier-based and explicit energy guidance, reducing complexity and computational resources. The approach leverages the inherent strengths of SDE-based generative models—flexible sampling and strong mode coverage—while using time-dependent feature-space similarity as a tractable, unsupervised guidance signal.

A plausible implication is that as semantic guidance in generative modeling shifts toward flexible, contrastive objectives, approaches like Contrastive-SDE may become increasingly prominent for data domains where hard alignment or explicit supervision is expensive or impossible. The architectural and empirical results suggest competitive performance can be achieved with substantial training acceleration by leveraging domain-invariant properties enforced via contrastive learning.

PDF Markdown Chat (Pro)

References (1)

Contrastive-SDE: Guiding Stochastic Differential Equations with Contrastive Learning for Unpaired Image-to-Image Translation (2025)

Follow Topic

Get notified by email when new papers are published related to Contrastive-SDE.