Contrastive CFG: Robust Diffusion Guidance

Updated 3 April 2026

Contrastive CFG (CCFG) is a guided sampling method for conditional diffusion models that replaces linear guidance with a contrastive objective to balance positive alignment and concept removal.
It dynamically adjusts guidance using noise-contrastive distances, mitigating artifacts from naïve negative CFG while preserving sample realism.
CCFG is integrated at sampling time with minimal overhead, validated on datasets like MNIST and CIFAR-10, and outperforms competitors on metrics such as FID and alignment scores.

Contrastive Classifier-Free Guidance (CCFG) is a guided sampling method designed for conditional diffusion models to address the limitations of standard Classifier-Free Guidance (CFG) in both positive (alignment) and negative (negation or editing) prompt scenarios. CCFG replaces the linear guidance vector of CFG with a contrastive objective, yielding robust sample quality for both concept attraction and selective concept removal, while avoiding the sample pathologies that arise from naïve negative CFG. The approach is implemented entirely at sampling time, introduces minimal computational overhead, and requires no retraining or external classifier.

1. Theoretical Motivation and Problem Statement

Standard CFG augments conditional diffusion sampling by interpolating between unconditional and conditional noise predictions: for a diffusion model with conditional noise estimator $\epsilon_\theta(x_t; y)$ and unconditional $\epsilon_\theta(x_t; \varnothing)$ , the DDIM update uses

$\epsilon_t^+ = \epsilon_\theta(x_t; \varnothing) + \gamma [\epsilon_\theta(x_t; y) - \epsilon_\theta(x_t; \varnothing)]$

where $\gamma > 1$ is the guidance scale. This procedure sharpens samples around $p(x|y)^ \gamma p(x)$ , significantly improving condition alignment (Chang et al., 2024).

However, many practical applications require negative guidance to suppress undesired features, implemented as

$\epsilon_t^- = \epsilon_\theta(x_t; \varnothing) - \gamma [\epsilon_\theta(x_t; y^-) - \epsilon_\theta(x_t; \varnothing)],$

where $y^-$ is a negative prompt. This "negated CFG" inverts the conditional density, resulting in sampling from $p(x)/p(x|y^-)^\gamma$ , which can push samples off the data manifold, leading to severe artifacts and a loss of generative precision. Dynamic Negative Guidance (DNG) partially addresses these issues by scaling the negative vector according to the model's confidence, but leaks occur when $y^-$ and the positive concepts are overlapping or subtle.

CCFG offers an alternative by embedding attraction and repulsion within a smooth noise-contrastive estimation (NCE) loss constructed from contrastive distances between conditional and unconditional noise predictions. The resulting dynamics interpolate between CFG and zero guidance depending on sample proximity to the prompt, preventing aggressive repulsion and associated sampling failures (Chang et al., 2024).

2. Mathematical Formulation and Algorithm

CCFG defines the positive and negative guidance vectors as gradients of a two-class NCE objective over denoised samples in DDIM-style diffusion:

Let $\mu^+ = \mu_\theta(x_t; y)$ , $\epsilon_\theta(x_t; \varnothing)$ 0 denote the DDIM mean estimates corresponding to the positive and unconditional predictions, respectively, with $\epsilon_\theta(x_t; \varnothing)$ 1 as the one-step denoised sample. The positive and negative contrastive losses are: $\epsilon_\theta(x_t; \varnothing)$ 2

$\epsilon_\theta(x_t; \varnothing)$ 3

with temperature hyperparameter $\epsilon_\theta(x_t; \varnothing)$ 4.

Guidance vectors are computed as follows, with $\epsilon_\theta(x_t; \varnothing)$ 5 and $\epsilon_\theta(x_t; \varnothing)$ 6:

Positive CCFG: $\epsilon_\theta(x_t; \varnothing)$ 7
Negative CCFG: $\epsilon_\theta(x_t; \varnothing)$ 8

As $\epsilon_\theta(x_t; \varnothing)$ 9, both dynamics recover standard CFG; as $\epsilon_t^+ = \epsilon_\theta(x_t; \varnothing) + \gamma [\epsilon_\theta(x_t; y) - \epsilon_\theta(x_t; \varnothing)]$ 0, negative guidance smoothly turns off. The only change to the DDIM update is the dynamic $\epsilon_t^+ = \epsilon_\theta(x_t; \varnothing) + \gamma [\epsilon_\theta(x_t; y) - \epsilon_\theta(x_t; \varnothing)]$ 1 applied per time step.

The algorithm integrates into standard DDIM, with the substitution of the dynamically scaled guidance step prior to the denoising update.

3. Empirical Evaluation and Comparison

CCFG was evaluated on class-conditional and text-to-image diffusion tasks using datasets including MNIST, CIFAR-10, and StableDiffusion 1.5 on open-ended prompts (Chang et al., 2024). Key findings include:

Error Rate vs. FID: CCFG consistently outperforms naïve negated CFG and DNG across guidance scales, achieving lower forbidden-class error rates at equivalent or better FID.
Text-to-Image Negative Prompting: For prompts such as "remove yellow flower," "no strawberries," "no clouds," and "no fried egg," CCFG removes the undesired concept while preserving positive alignment and high visual quality; naïve negated CFG introduces severe artifacts and reduces alignment.
COCO‐10k Benchmark: For large-scale negative prompting, CCFG achieves FID and positive-alignment scores statistically indistinguishable from baseline, while matching or exceeding the negation performance of prior methods.

Metrics used encompass CLIP cosine, ImageReward scores, HPS-v2 (human preference), and GPT-4-based visual concept grounding.

Method	Positive-Alignment (HPS-v2)	Negative-Alignment (GPT-4)	FID
None	0.265	0.301	19.62
nCFG	0.259	0.148	21.06
CCFG	0.265	0.153	19.96

A plausible implication is that CCFG preserves positive class semantics and image realism while achieving effective concept negation.

4. Guidance Dynamics, Trade-Offs, and Hyperparameter Behavior

The weighting function for attraction and repulsion, $\epsilon_t^+ = \epsilon_\theta(x_t; \varnothing) + \gamma [\epsilon_\theta(x_t; y) - \epsilon_\theta(x_t; \varnothing)]$ 2 and $\epsilon_t^+ = \epsilon_\theta(x_t; \varnothing) + \gamma [\epsilon_\theta(x_t; y) - \epsilon_\theta(x_t; \varnothing)]$ 3 respectively, enables CCFG to interpolate smoothly between full attraction, standard CFG, and zero guidance.

Guidance scale $\epsilon_t^+ = \epsilon_\theta(x_t; \varnothing) + \gamma [\epsilon_\theta(x_t; y) - \epsilon_\theta(x_t; \varnothing)]$ 4 controls the attraction/repulsion strength, while temperature $\epsilon_t^+ = \epsilon_\theta(x_t; \varnothing) + \gamma [\epsilon_\theta(x_t; y) - \epsilon_\theta(x_t; \varnothing)]$ 5 sets the sharpness of the transition; $\epsilon_t^+ = \epsilon_\theta(x_t; \varnothing) + \gamma [\epsilon_\theta(x_t; y) - \epsilon_\theta(x_t; \varnothing)]$ 6 was found stable across all tasks tested. CCFG integrates seamlessly with typical DDIM step schedules without additional runtime cost.

CCFG offers robustness to overlapping and subtle negative concepts, outperforming DNG and nCFG, particularly when the concepts are not disjoint. When the positive and negative target distributions are highly entangled, the separability of the induced Gaussians may reduce precision.

5. Qualitative Analysis and Visual Case Studies

CCFG yields minimal distortion and selective concept removal in challenging edit scenarios. For instance:

In "photo of a flower" with negative "yellow flower," CCFG images depict flowers with altered colors while maintaining realism, whereas nCFG removes all color information.
For "airplane flying" with negative "cloud," CCFG cleanly removes clouds without eliminating sky texture; DNG fails to suppress faint cloud features.
Synthetic two-cluster tasks demonstrate that CCFG cleanly suppresses forbidden clusters with no off-manifold mass drift, as opposed to nCFG which displaces significant probability mass to implausible regions.

These results indicate the superiority of CCFG in fine-grained and visually complex editing settings.

6. Implementation Notes and Limitations

CCFG is implemented as a modification to the sampling routine, introducing only a per-step computation of a scalar weighting factor. There is no requirement for additional model parameters, retraining, or external classifier modules.

The approach lacks, at present, a closed-form probabilistic interpretation analogous to the sharpened posterior induced by standard CFG; it is formalized as a guided sampling process targeting a contrastive NCE objective. When concept overlap is extreme and negative and positive target distributions cannot be reliably separated, negation precision may degrade. No explicit margin parameter is provided; control relies on $\epsilon_t^+ = \epsilon_\theta(x_t; \varnothing) + \gamma [\epsilon_\theta(x_t; y) - \epsilon_\theta(x_t; \varnothing)]$ 7, which may require adjustment if under-scaling is observed at low values.

7. Future Directions

Future work includes deriving a probabilistic distribution that explains CCFG’s contrastive guidance as a global density, extending CCFG to multi-concept editing by integrating multi-way contrastive losses, and adapting the approach to modalities and inverse problems such as inpainting, super-resolution, or audio/video diffusion where selective feature removal is desirable. Adaptive scheduling for $\epsilon_t^+ = \epsilon_\theta(x_t; \varnothing) + \gamma [\epsilon_\theta(x_t; y) - \epsilon_\theta(x_t; \varnothing)]$ 8 or $\epsilon_t^+ = \epsilon_\theta(x_t; \varnothing) + \gamma [\epsilon_\theta(x_t; y) - \epsilon_\theta(x_t; \varnothing)]$ 9 based on confidence or external criteria is a further line of exploration (Chang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Contrastive CFG: Improving CFG in Diffusion Models by Contrasting Positive and Negative Concepts (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contrastive CFG (CCFG).