DDIM Inversion for Novel View Synthesis

Updated 16 August 2025

DDIM inversion is a deterministic method mapping images into noise-conditioned latents that preserve structural content and fine texture detail.
TUNet leverages camera and scene embeddings for explicit geometric translation of the mean latent, facilitating view synthesis from a reference image.
Fusion strategies combine the noise component with the translated latent to recover high-frequency textures, ensuring photorealistic novel views.

Synthesizing novel views from a single image is a fundamentally challenging problem that requires inferring the 3D geometry of a scene, completing occluded regions, and ensuring geometric consistency and texture fidelity across multiple viewpoints. Recent research leverages the advances in diffusion models, particularly DDIM inversion, to address these challenges without the need for retraining or multi-view supervision (SIngh et al., 14 Aug 2025). The method provides a lightweight, explicit framework that fully exploits pretrained diffusion backbones for high-fidelity view translation.

1. DDIM Inversion for Latent Representation Extraction

DDIM inversion refers to deterministically mapping a synthesized or real image $x_0$ into a noisy latent $x_t$ at a chosen timestep $t$ by running the reverse DDIM process for $t$ steps. In this approach, the input image is first encoded into a latent (via the model VAE) and then DDIM-inverted up to an intermediate noise level ( $t^*$ , e.g., $t^*=600$ ). The resulting latent is decomposed into a mean (“signal”) $z^{\text{inv}}_\mu$ and variance (“noise”) $z^{\text{inv}}_\sigma$ component, exploiting the fact that DDIM inversion preserves coarse structure in the mean while encoding fine-grained textures in the noise. The equations governing this process are

$x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t} \epsilon$

and the practical reverse update (inversion iteration) is

$x_{t+1} = (x_t - \sqrt{1-\alpha_t} \epsilon_\theta(x_t, t)) \sqrt{\alpha_{t+1}} + \sqrt{1-\alpha_{t+1}} \epsilon_\theta(x_t, t)$

where $\epsilon_\theta(x_t, t)$ is the predicted noise.

Rather than inverting all the way to pure noise, stopping at an intermediate step (e.g., $t^*=600$ ) ensures the latent retains both semantic content and sufficient detail for view translation.

2. Camera Pose-Conditioned Latent Translation (TUNet)

Transformation U-Net (TUNet) predicts the DDIM-inverted latent for the desired target view from the extracted mean component $z^{\text{inv}}_{\mu,\text{ref}}$ of the reference view. Conditioning includes:

Relative and absolute camera intrinsics/extrinsics, implemented via an embedding $e_C$ .
A learned class embedding $e_c$ for the scene identifier.
Temporal embedding $\gamma(t)$ .

Conditioning vectors are concatenated or projected into the U-Net at downsampling, bottleneck, and upsampling stages. TUNet additionally employs a cross-attention mechanism to align feature maps conditioned on the ray embeddings of both reference and target camera poses, thus facilitating explicit geometric translation in the latent space.

The output $\hat{z}^{\text{inv}}_{\mu,\text{tar}}$ approximates the low-frequency (structural) information for the target viewpoint, but on its own typically lacks texture detail due to spectral bias in diffusion models.

3. Fusion Strategy to Recover High-Frequency Detail

Empirical observation shows that the noise (variance) component of the DDIM-inverted reference latent, $z^{\text{inv}}_{\sigma,\text{ref}}$ , encapsulates high-frequency texture information crucial for photorealistic synthesis. To inject this detail into the target view latent, two fusion strategies are evaluated:

Variance Fusion:

$z_{\text{noisy}} = \hat{z}^{\text{inv}}_{\mu,\text{tar}} + z^{\text{inv}}_{\sigma,\text{ref}}$

$\epsilon_\theta$ is computed using the pretrained diffusion model:

$\epsilon_\theta = \text{U-Net}(z_{\text{noisy}}, t)$

The final initial latent is then:

$\hat{z}^{\text{inv}}_{\text{tar}} = \hat{z}^{\text{inv}}_{\mu,\text{tar}} + \sqrt{1 + \alpha_{t+1}} \cdot \epsilon_\theta$

Direct Addition:

$\hat{z}^{\text{inv}}_{\text{tar}} = \hat{z}^{\text{inv}}_{\mu,\text{tar}} + \sqrt{1 + \alpha_{t+1}} \cdot z^{\text{inv}}_{\text{ref}}$

This method simply adds the full inverted latent from the reference view directly.

These strategies leverage the correlation structure of the DDIM inversion noise to transfer scene-specific texture and detail. This preserves textural fidelity that would otherwise be lost in low-frequency predictions.

4. DDIM Sampling for Novel View Synthesis

The fused latent $\hat{z}^{\text{inv}}_{\text{tar}}$ is used as the starting point for the DDIM generative process, run forward conditioned on the target camera pose and scene embedding. The pretrained diffusion model thus samples an image for the target view, leveraging its generative prior for high-fidelity, semantically plausible outputs. This allows the generation of novel views that maintain not only geometric layout (via TUNet’s output) but also realistic textures and surface details (via the noise fusion).

5. Experimental Results and Generalization

Extensive evaluation on the MVImgNet dataset demonstrates the effectiveness of this approach:

Model or Method	LPIPS (↓)	FID (↓)	PSNR (↑)	SSIM (↑)
Ours (Fusion)	lower	lower	higher	higher
GIBR	higher	higher	lower	lower
NViST	higher	higher	lower	lower

On both 3-class and 167-class subsets, the method achieved lower LPIPS and FID scores, higher PSNR and SSIM, and preserved long-range geometric consistency.
Qualitative analysis confirmed that reconstructions maintain surface geometry, handle occluded region completion plausibly, and avoid blurring common in prior works.
The model also generalized to unseen out-of-domain images, maintaining structural validity and realism.

6. Advantages over Prior Methods and Limitations

Relative to prior methods that require multi-view training or large-scale diffusion backbone finetuning, this pipeline:

Requires only a single input image and a pretrained diffusion model.
Does not perform any backbone finetuning.
Achieves sharp, high-resolution results with better preservation of scene details.
Decouples coarse geometry recovery (mean latent translation) from texture transfer (noise fusion), exploiting the structural properties of the DDIM-inverted latent space.

A plausible implication is that this two-branch latent separation and fusion could serve as a pattern for similar translation tasks in other conditional diffusion settings.

Limitations include dependence on the pretrained diffusion model’s domain priors and the assumption of sufficient latent expressiveness in the fused representations for extremely challenging out-of-distribution novel viewpoints.

7. Broader Significance and Future Directions

The use of DDIM inversion as a means of extracting a structured latent code—partitioned cleanly into mean and variance—enables explicit translation and detail-preserving synthesis in a plug-and-play manner. This framework circumvents many computational barriers to novel view synthesis, reduces engineering overhead, and demonstrates the utility of diffusion priors for 3D-aware generation.

Future research may explore:

More sophisticated fusion algorithms (e.g., learned blending masks, view-dependent attention).
Uncertainty estimation or adaptive fusion, especially in generative completion for occluded areas.
Extensions to time-sequential data (video) or multi-modal settings (e.g., cross-domain transfer).

In conclusion, this approach establishes DDIM inversion as a principled foundation for high-fidelity, geometry-aware, and texture-consistent novel view synthesis using pretrained diffusion models (SIngh et al., 14 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

Novel View Synthesis using DDIM Inversion (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to DDIM Inversion.