Camera Pose-Conditioned Latent Translation (TUNet)

Updated 22 August 2025

The paper introduces a novel view synthesis framework that leverages DDIM inversion and explicit camera pose conditioning to guide latent translation.
It employs cross-attention modules and fusion strategies, including variance fusion and direct noise addition, to preserve texture and high-frequency details.
Empirical results on MVImgNet demonstrate significant improvements in LPIPS, FID, PSNR, and SSIM, indicating robust generalization across unseen scenes.

Camera Pose-Conditioned Latent Translation (TUNet) is an architectural and methodological framework for novel view synthesis that leverages learned latent-space translation conditioned explicitly on camera parameters. The paradigm addresses the long-standing challenge of high-fidelity viewpoint generation, 3D scene consistency, and geometric transfer in generative models by operating in the latent domain of pre-trained diffusion models, with explicit pose conditioning and fusion strategies to preserve texture and detail. TUNet’s key innovations relate to the integration of DDIM inversion, pose-aware architectural conditioning, and fusion of latent statistics for image reconstruction, enabling efficient and accurate generation of novel scenes from a single reference viewpoint without retraining the backbone diffusion model (SIngh et al., 14 Aug 2025).

1. Foundation: Latent Translation via DDIM Inversion

Central to TUNet is the use of DDIM inversion, a deterministic diffusion model process that encodes a clean image $x_0$ into a latent code $\mathbf{x}_t$ after $t$ forward steps using: $\mathbf{x}_t = \sqrt{\overline{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\overline{\alpha}_t}\,\boldsymbol{\epsilon}$ where $\boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})$ represents Gaussian noise. This latent, at an intermediate time $t^*$ , encapsulates compressed semantic information and preserves both low- and high-frequency components of the original image.

TUNet learns a mapping from $\mathbf{z}_{\text{ref}}^{\text{inv}}$ (the DDIM-inverted latent of the reference view) to $\tilde{\mathbf{z}}_{\text{tar},\mu}^{\text{inv}}$ (the predicted mean latent for the desired target viewpoint), conditioned on camera intrinsics/extrinsics and scene class. This latent translation forms the backbone of view synthesis, where the pose-conditioning steers the generative process in the correct geometric direction.

2. Architectural Components and Pose Conditioning

TUNet is realized as a U-Net with an encoder–decoder topology, augmented with cross-attention modules and multi-level conditioning (SIngh et al., 14 Aug 2025). The architectural elements include:

Encoder: Processes the DDIM-inverted latent from the reference view, extracting hierarchical features.
Conditioning on Pose and Class: Camera parameters $(K, R, t)$ are embedded as $\mathbf{e}_C$ via a linear layer, while scene class embedding $\mathbf{e}_c$ and diffusion timestep embedding $\boldsymbol{\gamma}(t)$ are concatenated for each ResNet block.
Cross-Attention: Mid and up blocks utilize queries constructed from target ray embeddings (standard NeRF parameterization) and current feature maps, while keys/values incorporate reference view ray statistics and latent codes. The attention is computed as: $\text{Attn}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^{\top}}{\sqrt{d}}\right)\mathbf{V}$ This mechanism ensures geometric transfer and effective conditioning on the camera pose.
Output: The decoder outputs $\tilde{\mathbf{z}}_{\text{tar},\mu}^{\text{inv}}$ , representing the coarse prediction for the target view.

This design maintains the generative prior of the stable diffusion model and incorporates explicit geometric control.

3. Fusion Strategies for Texture and Detail Preservation

Due to the spectral bias of diffusion models—where low-frequency content dominates—the direct translation by TUNet may result in blurry reconstructions. To address this, two fusion strategies are proposed:

Strategy A: Variance Fusion via $\sigma$ -component

The noise (variance) from the reference latent $\mathbf{z}_{\text{ref},\sigma}^{\text{inv}}$ is added to the predicted mean latent: $\mathbf{z}_{\text{noisy}} = \tilde{\mathbf{z}}_{\text{tar},\mu}^{\text{inv}} + \mathbf{z}_{\text{ref},\sigma}^{\text{inv}}$ This operation reintroduces high-frequency details crucial for texture fidelity.

Strategy B: Direct Noise Addition

The entire DDIM-inverted reference latent (after scaling) is summed: $\tilde{\mathbf{z}}_{\text{tar}}^{\text{inv}} = \tilde{\mathbf{z}}_{\text{tar},\mu}^{\text{inv}} + \sqrt{1+\alpha_{t+1}}\,\mathbf{z}_{\text{ref}}^{\text{inv}}$ Both strategies exploit the inherent noise correlation structure of DDIM inversion, acting as a mechanism for reconstructing fine textures and details otherwise omitted in mean-only translation. The fused latent is then passed as the initial condition to DDIM sampling for image synthesis.

4. Generative Prior and DDIM Sampling

TUNet leverages the pretrained latent diffusion model (LDM), where the fused latent is used as the initialization in the generative process. The subsequent DDIM sampling step employs: $\tilde{\mathbf{z}}_{\text{tar}}^{\text{inv}} = \tilde{\mathbf{z}}_{\text{tar},\mu}^{\text{inv}} + \sqrt{1+\alpha_{t+1}}\,\boldsymbol{\epsilon}_\theta$ Here, $\boldsymbol{\epsilon}_\theta$ , the predicted noise, is computed with the stable diffusion U-Net. This approach enables the generative backbone to impose photorealism and geometric consistency, mitigating potential degradations caused by aggressive latent manipulation.

5. Experimental Validation and Quantitative Benchmarks

Empirical evaluation on MVImgNet (3-category and 167-category settings) demonstrates the efficacy of TUNet in preserving perceptual quality and generalizing across classes and scenes (SIngh et al., 14 Aug 2025). Main findings include:

LPIPS reduced to 0.409 for the 167-class model, surpassing previous methods such as GIBR and NViST.
FID (Fréchet Inception Distance) improved from 91.63 to 65.50, indicating better alignment with the training distribution.
PSNR and SSIM were higher compared to baselines, consistently across seen and unseen classes.
The fusion strategy was shown to be critical for fine-grained detail retention; omission of cross-attention or camera/class conditioning led to measurable declines in quality.

Notably, the approach is applicable to previously unseen categories and out-of-domain data, supporting robust generalization.

6. Theoretical and Practical Implications

TUNet’s design—latent translation conditioned on pose, fusion to preserve high-frequency details, and backbone-agnostic deployment—sets a new direction for efficient, controllable novel view synthesis. This methodology avoids the computational expense of retraining large diffusion models and does not require multi-view supervision. Factual outcomes suggest:

The explicit pose-conditioning mechanism enables accurate geometric transfer in latent space, a previously challenging aspect for generative models in single-image novel view synthesis.
Fusion strategies leveraging noise correlations provide a means to circumvent low-pass filtering inherent in diffusion models, thus maintaining image fidelity.
The approach is extensible to multi-view datasets, unpaired data scenarios, and can be adopted for related tasks such as robust camera localization and scene editing via pose-aware translation.

A plausible implication is that similar fusion and conditioning principles may be integrated into future generative frameworks—including latent-based video synthesis and cross-modal translation—where pose control and texture preservation are required.

7. Contextualization within the Latent Translation Literature

TUNet operationalizes recent advances in camera-conditioned latent translation, integrating lessons from works on camera-aware GANs (Liu et al., 2020), 3D-consistent unsupervised view synthesis (Ramirez et al., 2021), and latent code manipulations for video generation (Zhou et al., 2024). Its focus on architecture-level pose control, latent fusion, and leveraging generative priors positions it as a representative solution to efficient and accurate geometry-aware scene reconstruction and view synthesis.

The design principles—pose-dependent embeddings, cross-attention feature transfer, and fusion via noise correlation—share methodological similarities with projective positional encoding (PRoPE) (Li et al., 14 Jul 2025), suggesting substantial scope for future convergence of transformer-based camera-conditioned models and pose-aware U-Nets.

In summary, Camera Pose-Conditioned Latent Translation (TUNet) represents a rigorously validated framework for novel view synthesis in latent space, delivering high-fidelity, geometrically accurate reconstructions from a single view by exploiting pose-conditioning, fusion of latent statistics, and generative priors of pretrained diffusion models (SIngh et al., 14 Aug 2025).