TUNet: Camera Pose-Conditioned Translation U-Net

Updated 21 August 2025

The paper introduces TUNet, a U-Net architecture that conditions on camera pose to predict DDIM-inverted latents for novel view synthesis.
Multi-stage conditioning with cross-attention and ray embeddings enables precise geometric translation and effective noise fusion to preserve fine details.
Evaluations on MVImgNet show that TUNet outperforms previous methods with improved LPIPS, PSNR, SSIM, and FID metrics, ensuring photorealistic, coherent outputs.

A Camera Pose-Conditioned Translation U-Net (TUNet) is an encoder–decoder neural architecture designed to perform geometric translation or latent transformation between different camera viewpoints, explicitly conditioned on camera pose. In the context of novel view synthesis, TUNet predicts the DDIM-inverted latent corresponding to a desired target view from the latent of an input (reference) image, with camera pose parameters injected as conditioning at multiple stages. This architecture allows an explicit geometric grounding of the translation, facilitating high-fidelity generation of new views that are both photorealistic and geometrically consistent in accordance with the specified camera transformation, especially when integrated with powerful pretrained image diffusion models (SIngh et al., 14 Aug 2025).

1. TUNet Architectural Design

TUNet employs a U-Net–like encoder–decoder backbone that operates on the DDIM-inverted mean latent of a reference image. The key architectural differences from classical U-Nets are as follows:

Multi-stage Conditioning: At each stage (downsample, mid, and upsample blocks), the intermediate feature maps are conditioned on camera intrinsics/extrinsics, scene class, and time embedding. The camera parameters are first projected via a learned linear layer, combined with class and time embeddings, then spatially broadcast and added channel-wise to features.
Cross-Attention Mechanisms: Mid and up blocks integrate cross-attention modules. Here, ray embeddings derived from both the reference view (serving as keys/values) and the target view (serving as queries) mediate the fusion, allowing geometry-aware transfer of information between viewpoints.
Latent-to-Latent Translation: TUNet predicts the DDIM-inverted latent mean for the target pose, not a full image or a denoised latent. This design enforces the network to learn the complex mapping required to model geometric shifts between camera viewpoints.

This structure is kept lightweight and is specifically adapted for use with pretrained diffusion models, such as Latent Diffusion Models (LDMs), serving as the generative prior.

2. Explicit Integration of Camera Pose

TUNet uses both camera intrinsics and extrinsics in its conditioning pipeline. These are preprocessed (e.g., concatenated and projected), combined with a class embedding and the current DDIM timestep (capturing latent diffusion progress), and periodically re-injected into the network's activations. Cross-attention is further enriched by ray embeddings computed from camera parameters, enabling correspondence across 3D viewing rays for the reference and target positions.

This explicit camera-pose conditioning allows TUNet to infer the appropriate geometric transformation in latent space, ensuring that the predicted latent for the target view is consistent with the desired pose transformation.

3. DDIM Inversion and Two-Part Latent Representation

Novel view synthesis with TUNet begins by using the pretrained VAE encoder to project an input image into latent space, followed by DDIM inversion up to a mid-timestep ( $t=600$ as specified). The inverted latent $z^{\text{inv}}$ is split into:

Signal (Mean) Component $\hat{z}^{\text{inv}}_\mu$ : Coarse geometry, low-frequency content.
Noise (Variance) Component $z^{\text{inv}}_\sigma$ : High-frequency details, textures, edge information.

The mean component is the primary input to TUNet, which is then conditioned to predict the target view mean latent.

4. Noise Fusion Strategy for Detail Preservation

Naive translation of the signal component leads to the accumulation of blurriness, a common artifact due to the spectral bias of both DDIM inversion and the absence of high-frequency correlation during translation. TUNet employs two complementary fusion strategies:

Variance Fusion (Strategy A): After predicting the target mean latent, the noise (variance) component from the reference image $z^{\text{inv}}_\sigma$ is added to the predicted target latent mean:

$z_{\text{noisy}} = \hat{z}^{\text{inv}}_{\text{tar},\,\mu} + z^{\text{inv}}_{\text{ref},\,\sigma}$

This directly blends high-frequency components from the observed view.

Direct Noise Addition (Strategy B): The full DDIM-inverted latent of the reference image is added (scaled) to the predicted target latent mean:

$\hat{z}^{\text{inv}}_{\text{tar}} = \hat{z}^{\text{inv}}_{\text{tar},\,\mu} + \sqrt{1+\alpha_{t+1}} \cdot z^{\text{inv}}_{\text{ref}}$

Both strategies exploit the inherent noise correlation structure in DDIM inversion to transfer texture and detailed fidelity from source to target view.

5. DDIM Sampling and Pretrained Generative Priors

The fused target latent acts as the initial condition for subsequent DDIM sampling through a pretrained diffusion model's U-Net and VAE decoder. This approach leverages the strong generative prior and learned distribution of natural images encoded in the latent diffusion model, which enables:

Recovery of high-frequency details otherwise missing in pure translation approaches.
Enforcing photorealism and global coherence via the diffusion denoising pathway.

By exploiting this prior, TUNet generates novel views that are not only aligned with the desired target pose but also exhibit natural image statistics and semantic consistency.

6. Quantitative and Qualitative Performance

Comprehensive experiments on MVImgNet demonstrate that TUNet, in conjunction with the DDIM inversion and noise fusion strategies, outperforms existing single-image novel view synthesis methods. On the 3-class split at 256×256 resolution, TUNet achieves an LPIPS of 0.490 (GIBR baseline: 0.510). On the 167-class split at 90×90, it produces LPIPS of 0.409, PSNR of 16.16, SSIM of 0.578, and FID of 65.50—surpassing NViST (LPIPS 0.448, PSNR 14.31, SSIM 0.566, FID 91.63). Qualitative analysis confirms high-fidelity synthesis both in short and long-range view transformations, with detailed reconstruction of occluded and out-of-distribution regions.

7. Implications and Extensions

The Camera Pose-Conditioned Translation U-Net paradigm demonstrates that explicit camera-pose conditioning and careful latent manipulation (with noise fusion) enable competitive, robust novel view synthesis from a single image, without training or fine-tuning large diffusion models. The architecture is readily extendable to other pose-conditioned image translation tasks, provided a suitable generative prior and a means of extracting pose-aligned latent representations. The explicit focus on geometric consistency, cross-attention with ray embedding, and leveraging pretrained diffusion priors distinguish TUNet from earlier pose-conditioned networks and endow it with improved generalization and sample quality (SIngh et al., 14 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

Novel View Synthesis using DDIM Inversion (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Camera Pose-Conditioned Translation U-Net (TUNet).