Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 147 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 96 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 398 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Contrastive-SDE in Unpaired Image Translation

Updated 11 October 2025
  • Contrastive-SDE is a method that employs time-dependent contrastive learning to guide stochastic differential equations for unpaired image translation.
  • It integrates a two-stage process combining unsupervised contrastive representation learning with score-based diffusion guidance to maintain semantic fidelity.
  • Empirical results show competitive realism and superior faithfulness metrics with lower computational costs compared to classifier-guided approaches.

Contrastive-SDE refers to a methodology that leverages contrastive learning to guide stochastic differential equations (SDEs) for unpaired image-to-image (I2I) translation tasks. This approach introduces a time-dependent contrastive signal—trained to encode domain-invariant features—that is injected into the sampling trajectory of a pretrained score-based diffusion model to enforce semantic consistency between translated outputs and source images, without use of explicit supervision or paired data (Kotyada et al., 4 Oct 2025).

1. Conceptual Framework and Motivation

Contrastive-SDE is motivated by two converging trends in generative modeling for unpaired I2I: the success of score-based diffusion models for representing complex distributions via SDEs, and the capability of contrastive learning to extract semantic structure in the absence of supervision. Traditional unpaired I2I frameworks, such as adversarial methods (CycleGAN, MUNIT), enforce cross-domain constraints via adversarial or cycle-consistency losses, while more recent diffusion-based strategies employ classifier or energy-based guidance, which typically require training explicit domain classifiers and can be computationally costly.

Contrastive-SDE departs from classifier or energy guidance: it instead uses an unsupervised, time-conditioned contrastive learning module to learn representations that are invariant to domain changes. These learned domain-invariant features are then used to guide the reverse SDE sampling dynamics during image translation, promoting output images that maintain semantic correspondence with the source while discarding domain-specific artifacts.

2. Core Methodology

The approach centers around a two-stage process:

  1. Contrastive Representation Learning:
    • A U-Net-based architecture, denoted FF, is trained using a time-conditional contrastive loss (SimCLR NT-Xent loss) to map each image xx and its low-pass filtered version x~\tilde{x} to similar feature representations.
    • The low-pass filter serves to accentuate domain-invariant image content while suppressing domain-specific (often high-frequency) details.
  2. Score-Based Diffusion Guidance:
    • The forward generative process is defined by a stochastic differential equation:

    dy=f(y,t)dt+g(t)dw(t)dy = f(y, t) dt + g(t) dw(t)

  • The reverse (sampling) process uses an estimated score function sθ(y,t)s_\theta(y, t):

    dy=[f(y,t)g(t)2sθ(y,t)]dt+g(t)dw^(t)dy = \left[ f(y, t) - g(t)^2 s_\theta(y, t) \right] dt + g(t) d\hat{w}(t)

  • During translation, this reverse SDE is modified by a contrastive guidance term:

    dy=[f(y,t)g(t)2(sθ(y,t)yQ(y,x0,t))]dt+g(t)dw^(t)dy = \left[ f(y, t) - g(t)^2 \left( s_\theta(y, t) - \nabla_y Q(y, x_0, t) \right) \right] dt + g(t) d\hat{w}(t)

    Here, Q(y,x0,t)=λS(y,x0,t)Q(y, x_0, t) = -\lambda S(y, x_0, t), with S(y,x0,t)S(y, x_0, t) calculating a channel-wise cosine similarity between the contrastive features of yy and the original source image x0x_0, both extracted at the current time tt:

    S(y,x0,t)=1MNm=1Mn=1Nht(mn)h0(mn)ht(mn)2h0(mn)2S(y, x_0, t) = \frac{1}{MN} \sum_{m=1}^M \sum_{n=1}^N \frac{h_t^{(mn)\top} h_0^{(mn)}}{ \| h_t^{(mn)} \|_2 \| h_0^{(mn)} \|_2 }

    This gradient-based signal ensures that, at each reverse diffusion step, the generated image yy remains semantically aligned with the domain-invariant content of x0x_0.

3. Practical Implementation and Architectural Details

  • Contrastive Module Training:

    • The contrastive model FF is trained in advance, using simulated pairs (x,x~)(x, \tilde{x}), with the NT-Xent loss:

    LNT-Xent=logexp(sim(zi,zj)/τ)kiexp(sim(zi,zk)/τ)\mathcal{L}_{\text{NT-Xent}} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k\neq i} \exp(\text{sim}(z_i, z_k)/\tau)}

    where sim(,)\text{sim}(\cdot, \cdot) is cosine similarity, and τ\tau is the temperature parameter.

  • Diffusion Model and Guidance Integration:

    • The score-based diffusion model is pretrained on the target domain.
    • During test-time I2I translation, the trained contrastive network supplies yQ\nabla_y Q, which is injected into the reverse SDE. This requires backpropagation of the contrastive similarity score with respect to the generated image yy at each SDE time step.
  • Guidance Scheduling and Hyperparameters:
    • The contrastive guidance influence λ\lambda and the starting time step (controlling the strength and duration of the contrastive signal) are tunable. Adjusting these hyperparameters enables trading off between output image realism (as measured by FID) and strict semantic faithfulness to the source.

4. Empirical Evaluation and Results

Contrastive-SDE is quantitatively evaluated on public unpaired I2I translation benchmarks, including Cat\rightarrowDog, Wild\rightarrowDog, and Male\rightarrowFemale (AFHQ, CelebA). Four metrics are used: FID (for realism), L2 distance, PSNR, and SSIM (for faithfulness).

Key empirical observations:

Model FID (↓) L2 (↓) PSNR (↑) SSIM (↑)
Contrastive-SDE mod best best best
EGSDE lower --- --- ---
SDDM --- --- --- ---
  • Contrastive-SDE achieves state-of-the-art or near-state-of-the-art performance on the faithfulness metrics and achieves competitive FID.
  • The method converges faster (2K iterations in 2 hours) than classifier-guided methods (e.g., EGSDE: 5K iterations in 7 hours).
  • No supervised labels or external classifiers are required, streamlining the training process.

5. Comparative Analysis and Trade-offs

Contrastive-SDE contrasts with previous unpaired I2I methods in several ways:

Method Type Semantic Consistency Realism (FID) Computational Cost Explicit Supervision
Adversarial (CycleGAN) Partial (Cycle Loss) Good High N
Classifier/EGSDE Guidance Strong (Classifier) Best High Y
Contrastive-SDE Strong (Contrastive) Comparable Low N

Limitations:

  • The use of low-pass filtering for domain-invariant feature selection is not optimal; some domain-specific information may persist, modestly limiting realism.
  • Improvements could be obtained by developing more sophisticated domain-invariant feature extractors or explicitly penalizing domain-specific details in future iterations.

6. Applications and Prospective Extensions

Contrastive-SDE is applicable to unpaired image translation tasks where no aligned or paired data exists, and domain supervision is limited or unavailable. Specific use cases include:

  • Animal face translation (e.g., Cat\rightarrowDog)
  • Attribute transfer (e.g., Male\rightarrowFemale)
  • Scene translation (suggested: Summer\rightarrowWinter, Horse\rightarrowZebra)

Future research directions highlighted by the authors include:

  • Employing more advanced—or non-parametric—domain-invariant feature extractors.
  • Explicitly incorporating penalty terms for domain-specific features to further boost realism.
  • Broadening the approach to additional challenging unpaired translation benchmarks.

7. Significance and Positioning within Diffusion Guidance Paradigms

Contrastive-SDE exemplifies a paradigm where perceptual or semantic guidance is supplied during generative diffusion, entirely via self-supervised contrastive learning. This stands in contrast to classifier-based and explicit energy guidance, reducing complexity and computational resources. The approach leverages the inherent strengths of SDE-based generative models—flexible sampling and strong mode coverage—while using time-dependent feature-space similarity as a tractable, unsupervised guidance signal.

A plausible implication is that as semantic guidance in generative modeling shifts toward flexible, contrastive objectives, approaches like Contrastive-SDE may become increasingly prominent for data domains where hard alignment or explicit supervision is expensive or impossible. The architectural and empirical results suggest competitive performance can be achieved with substantial training acceleration by leveraging domain-invariant properties enforced via contrastive learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Contrastive-SDE.