Image-Conditioned Diffusion Policies

Updated 8 October 2025

Image-conditioned diffusion policies are frameworks that integrate image-derived cues into diffusion models to guide both synthesis and decision-making processes.
They utilize innovative conditioning mechanisms, trajectory-shifting techniques, and adaptive noise schedules to embed semantic and spatial information throughout the diffusion process.
These policies find practical applications in image editing, 3D reconstruction, robotic manipulation, and real-time inpainting, enhancing both fidelity and robustness.

Image-conditioned diffusion policies constitute a class of generative and decision-making frameworks wherein diffusion models are conditioned on visual information—such as images, spatial embeddings, or image-derived representations—to control the synthesis or action-selection processes. By integrating conditioning signals derived from image data, these policies enhance controllability, semantic alignment, and adaptability in high-dimensional vision-driven applications, ranging from image generation and editing to robotic manipulation and 3D reconstruction. Recent developments introduce advanced mechanisms for embedding, factorizing, and adaptively regulating policy conditioning, drawing on innovations in joint embedding spaces, trajectory-shifting formulations, selective guidance, and architectural modularity.

1. Conditioning Mechanisms for Diffusion Models

Traditional diffusion models begin generation with random Gaussian noise, imposing conditioning primarily during the reverse denoising phase. Recent research introduces alternative schemes wherein image-derived information is injected directly into the initial noise or diffused across all timesteps for enhanced control. A notable example is the use of “object saliency” noise constructed via Inverting Gradients (IG), which encodes semantic and localization cues in the initial noise (Singh et al., 2022). The IG procedure optimizes the input noise such that its classifier loss gradients closely align (in cosine similarity) with those of a target semantic image, thereby embedding spatial priors and semantic information directly into the diffusion process:

$\min_{x} \left\{1 - \frac{\langle \nabla_\theta \mathcal{L}_\theta(x, y), \nabla_\theta \mathcal{L}_\theta(x^*, y)\rangle}{\|\nabla_\theta \mathcal{L}_\theta(x, y)\| \|\nabla_\theta \mathcal{L}_\theta(x^*, y)\|}\right\}$

This method contrasts with approaches that rely on classifier guidance at inference, shifting conditional control to the earliest steps of diffusion.

In policy domains, joint image-shape embedding spaces are constructed (e.g., using transformer encoders, contrastive objectives) to enable flexible conditioning. In IC3D, image embeddings steer voxelized 3D shape generation via dual pathways—direct projection to timestep encodings and token injection into attention modules—combined with classifier-free guidance (Sbrolli et al., 2022).

2. Trajectory-shifting and Factorization: Forward-process Condition Integration

Beyond conditioning solely during denoising, trajectory-shifting approaches integrate conditional signals throughout the forward diffusion process. ShiftDDPMs modify the forward kernel to embed conditions (image embeddings, class labels, attributes) as trajectory shifts, yielding exclusive latent paths for each condition (Zhang et al., 2023):

$q(x_t \mid x_0, c) = \mathcal{N}\left(\sqrt{\overline{\alpha}_t} x_0 + s_t, (1-\overline{\alpha}_t) \Sigma \right), \;\; s_t = k_t \cdot E(c)$

Varying $k_t$ schedules (e.g., prior-shift, quadratic-shift) and shift predictors $E(c)$ enable fine-grained trajectory disentanglement. This mechanism disperses semantic information across all timesteps, facilitating robust conditional modeling and controlled transformations such as attribute blending and domain transitions.

Factorized Diffusion Policies (FDP) extend this principle to multimodal robot observation data by decomposing policy scores into prioritized base streams and residual components—thus explicitly modeling modality prioritization (e.g., vision > tactile) (Patil et al., 20 Sep 2025):

Policy Variant	Conditioning Streams	Main Advantage
Standard DP	Jointly (all modalities)	Simplicity
FDP (base + residual)	Prioritized + corrective residual	Robustness, safety

The score is factorized via Bayes’ rule, allowing for improved sample efficiency and resilience under distributional shifts.

3. Selective Guidance, Prior Decoupling, and Late Constraints

To address the limitations of adding noise for semantic manipulations versus fidelity preservation, approaches such as Selective Diffusion Distillation (SDD) distill semantic priors from diffusion models into feedforward neural manipulator networks. By computing Hybrid Quality Score (HQS) indicators—using entropy and gradient magnitude of diffusion-based semantic guidance—SDD selects optimal timesteps for distillation (Wang et al., 2023). This reconciles editability and fidelity, enabling high-quality manipulation in a single pass.

LaCon introduces late-constraint guidance, where a lightweight Condition Adapter aligns external structural cues (e.g., edges, masks, palettes) with internal diffusion features. The methodology leverages a score composition:

$\hat{\epsilon}_t = \epsilon_t + \beta \epsilon_{\text{str}}$

where $\epsilon_{\text{str}}$ encodes the gradient of the condition discrepancy. LaCon demonstrates successful plug-and-play control and strong generalization across semantic and structural condition types (Liu et al., 2023).

In diffusion-prior frameworks, an intermediate shared representation (CLIP embedding) decouples semantic alignment from pixel-level generation, simplifying domain and color-conditioned synthesis while conserving computational resources (Aggarwal et al., 2023).

4. Adaptive Control, Efficiency, and Robustness

Diffusion model generation has traditionally relied on static-length sampling and fixed noise schedules. Adaptively Controllable Diffusion Models (AC-Diff) introduce modules for conditional-centric process control (Xing et al., 19 Nov 2024):

Conditional Time-Step (CTS): Estimates needed denoising steps per sample via fused text and image embeddings, modulated by complexity metrics (e.g., spatial entropy).
Adaptive Hybrid Noise Schedule (AHNS): Recalculates the noise schedule for each sample, combining learned coefficients:

$\beta'_t = \lambda \beta_t + (1-\lambda) \tilde{\beta}_t$

Such adaptive scheduling reduces unnecessary computation while maintaining or improving fidelity and semantic accuracy.

In action-policy domains, findings indicate that diffusion policies with limited data memorization behave as action lookup tables—retrieving the nearest memorized action for observed images rather than generalizing (He et al., 9 May 2025). The explicit Action Lookup Table (ALT) matches diffusion model performance while drastically reducing inference time and resource load. ALT’s OOD detection via latent similarity thresholds adds runtime safety, pertinent for robotics applications.

5. Expanding to 3D, Real-time Synthesis, and Inpainting

Image-conditioned diffusion policies are being extended beyond standard pixel synthesis to complex spatial and physical generative tasks:

3D Shape Generation and Human Reconstruction: IC3D and SiTH demonstrate the use of joint image-shape embeddings and adapted latent diffusion models to generate textured 3D shapes and human meshes from single or partial views, handling occlusions and out-of-distribution inputs robustly (Sbrolli et al., 2022, Ho et al., 2023).
Real-time Image Inpainting and Virtual Try-All: Methods such as Diffuse to Choose (DTC) integrate pixel-level cues from reference items with FiLM-modulated auxiliary UNet branches and perceptual loss terms to efficiently insert products into user scenes, supporting zero-shot virtual try-on applications (Seyfioglu et al., 24 Jan 2024).
Learning-free Customization: The Cyclic One-Way Diffusion (COW) method preserves fine-grained visual conditions by cyclically injecting and reconstructing latent representations, enabling flexible and efficient control for style transfer, landmark editing, and inpainting (Wang et al., 2023).

6. Mathematical Formulations, Performance Metrics, and Implementation Aspects

Key mathematical mechanisms across these approaches include:

Forward process modification with additive or multiplicative conditional shifts in latent space (Zhang et al., 2023).
Score-based diffusion representations for policy learning in imitation and goal-conditioned behavior (Reuss et al., 2023).
Loss formulations for decoupled noise and structural component prediction; e.g., decoupled training for image-to-zero and zero-to-noise mapping:

$\min_{\theta} \mathbb{E}[ \|\phi_\theta - \phi\|^2 + \|\epsilon_\theta - \epsilon\|^2 ]$

(Huang et al., 2023).

Performance is evaluated using FID, CLIP alignment, human evaluators, success rates on manipulation benchmarks, Chamfer Distance and normal consistency (3D), and real-time inference latency. Notable results include 15–40% improvement in success rates with modality prioritization (Patil et al., 20 Sep 2025), near-perfect identity retention and efficient customization with cyclic strategies (Wang et al., 2023), and real-world deployment feasibilities enabled by fast, adaptive sampling (Xing et al., 19 Nov 2024).

7. Application Scenarios and Practical Impact

Image-conditioned diffusion policies are actively shaping diverse application domains:

Image generation and editing: Steerable synthesis, color and stylistic control, plug-and-play editing (Liu et al., 2023, Aggarwal et al., 2023).
Robot manipulation: Policy robustness under observation noise, sample-efficient skill learning, explicit memorization or OOD safety triggers (He et al., 9 May 2025, Patil et al., 20 Sep 2025).
3D modeling and inpainting: Textured mesh synthesis from single views, highly realistic scene completion (Ho et al., 2023).
E-commerce and real-time visualization: Zero-shot inpainting for virtual try-on, personalization, efficient integration into user workflows (Seyfioglu et al., 24 Jan 2024).

The field continues to evolve with advances in adaptive process control, modular architectural design, efficient priors, and robust multimodal conditioning, pushing the boundaries of fidelity, controllability, and application breadth in vision-guided generative modeling and policy learning.