Papers
Topics
Authors
Recent
2000 character limit reached

Pose-guided Multi-view Multimodal Diffusion

Updated 24 December 2025
  • The paper introduces a PMMD model that leverages explicit pose guidance, multi-view fusion, and multimodal encoders to generate high-fidelity, structurally consistent images.
  • It details an innovative architecture combining diffusion-based denoising with specialized modules like Residual Cross-View Attention and classifier-free guidance, achieving improved metrics (FID 8.56, LPIPS 0.1909, SSIM 0.7397).
  • The framework integrates geometry-aware volumetric rendering and robust multi-view conditioning to support applications in digital human creation, virtual try-on, and geospatial analysis.

Pose-guided Multi-view Multimodal Diffusion (PMMD) refers to a class of diffusion models that generate high-fidelity, identity-consistent images (primarily of people, but extensible to objects and scenes) conditioned on multiple reference views, pose representations, and multimodal signals such as text, masks, or sensor-specific modalities. PMMD leverages explicit pose guidance, multi-view input fusion, and sophisticated multimodal encoders to address occlusion, texture completion, and cross-view consistency, outperforming traditional single-view or unimodal diffusion-based synthesizers in detailed structure preservation, control, and perceptual quality (Shang et al., 17 Dec 2025, Xie et al., 19 Nov 2025, Berian et al., 16 Jan 2025, Tang et al., 2023, Cheong et al., 2023).

1. Core Principles of Pose-guided Multi-view Multimodal Diffusion

PMMD frameworks are characterized by the following design components:

  • Multi-view Appearance Conditioning: Inputs are sets of images from diverse viewpoints, each associated with precise pose metadata or structured geometry (e.g., camera matrices, SMPL body parameters, DensePose maps).
  • Multimodal Fusion: Conditioning is realized by jointly modeling image, pose, text, and, in geospatial/scientific domains, sensor modality (e.g., EO, SAR, LiDAR) using learned encoders and shared latent spaces.
  • Diffusion Backbone: Generation is performed within a Denoising Diffusion Probabilistic Model (DDPM) or related stochastic process, nearly always instantiated as a UNet operating in a VAE-latent space for computational efficiency and resolution scalability.
  • Explicit Pose Guidance: Fine-tuned branches (often using ControlNet-like mechanisms) inject pose features at all stages of the UNet, ensuring tight adherence to target pose or camera-view during synthesis.
  • Cross-View and Cross-Modal Attention: Specialized modules such as Residual Cross-View Attention (ResCVA) or Correspondence-Aware Attention (CAA) are used to exchange information across different views or modalities, enabling uniformity and consistency.

The overarching principle is that pose, view, and semantic content are jointly fused to generate outputs that are photorealistic, pose-accurate, semantically controlled, and consistent across reference and target views (Shang et al., 17 Dec 2025, Berian et al., 16 Jan 2025, Xie et al., 19 Nov 2025, Tang et al., 2023).

2. Model Architectures and Information Fusion

PMMD implementations unify multiple architectural innovations to address the multimodal, multi-view conditions:

  • Multimodal Encoders: Visual inputs (stacked multi-view images), pose maps (DensePose, SMPL, or keypoints), and language/text are separately encoded using VAEs, ControlNets, and CLIP-family encoders, then fused by lightweight adapters or token concatenation (Shang et al., 17 Dec 2025, Cheong et al., 2023).
  • Masked Multi-View Training: During optimization, random masking of input views enforces robustness to missing references and enables interpolation between single- and multi-view cases.
  • Prior Modules: Appearance Prior Modules (APMs; e.g., in JCDM (Xie et al., 19 Nov 2025)) are trained to predict a global semantic embedding (e.g., in CLIP space) summarizing identity/features across available views.
  • Joint Conditional Injection: The UNet receives a composite latent vector combining noisy target images, view masks, appearance priors, and pose features; cross-attention layers at every UNet depth enable concurrent reasoning over all modalities and views.
  • Residual Cross-View Attention (ResCVA): UNet feature maps are divided spatially, allowing local self-attention to capture view-to-view correspondences and refine local detail while maintaining global structure (Shang et al., 17 Dec 2025).
  • Classifier-Free Guidance: PMMD universally employs classifier-free guidance by random conditional dropout during training and controlled denoising at inference, increasing perceptual controllability (Shang et al., 17 Dec 2025, Cheong et al., 2023).

These mechanisms ensure that the UNet can ingest and utilize complex, complementary information about pose, appearance, semantics, and sensor modality at every stage of the denoising process.

3. Geometry-Aware and Volumetric Representations

PMMD, especially in the context of scene or object synthesis (e.g., CrossModalityDiffusion (Berian et al., 16 Jan 2025)), incorporates volumetric and geometry-aware pipelines:

  • Feature Volume Construction: For each modality and reference view, modality-specific encoders map images and known camera poses into 3D feature volumes residing in the camera frustum (Berian et al., 16 Jan 2025).
  • World-to-View Transformation: Points are transformed into each view’s local volume coordinate system, and per-view features are sampled via trilinear interpolation. The features are fused (often by mean-pooling) into a scene-centric latent field.
  • Volumetric Rendering: For novel target poses, rays are cast, and features are sampled along each ray; these are passed through an MLP to estimate feature and density, and rendered into a conditioning feature image via discretized volume integration.
  • Diffusion Model Conditioning: The rendered feature image is used as input to a modality-specific diffusion denoiser, enabling cross-modal and cross-view synthesis even in the absence of explicit scene geometry.

This approach enables robust pose-guided synthesis across modalities, directly incorporating geometric structure and maintaining consistency in the output images under viewpoint change (Berian et al., 16 Jan 2025).

4. Training Protocols, Objectives, and Inference

PMMD training and inference routines exhibit several shared characteristics:

  • Losses: The predominant objective is an MSE-based noise-prediction loss matching true noise to predicted, possibly with spatial weighting masks to emphasize critical regions (face, limbs) (Shang et al., 17 Dec 2025, Cheong et al., 2023).
  • Joint Training: All encoders (for different modalities) and diffusion model weights are updated jointly, with losses summed over random source/target combinations to enforce cross-modal and cross-view alignment (Berian et al., 16 Jan 2025, Xie et al., 19 Nov 2025).
  • Guidance and Conditional Dropout: Guidance parameters (e.g., ω=0.7) balance full conditioning with partially unconditioned predictions; at training, inputs such as views or text may be dropped with fixed probability for classifier-free guidance (Shang et al., 17 Dec 2025).
  • Inference: At each denoising step, multimodal features are recomputed and mixed as guided by the classifier-free regime. For multi-view output, joint latent maps are processed in parallel with cross-branch attention ensuring consistency (as in MVDiffusion (Tang et al., 2023)).

No additional adversarial or perceptual losses are introduced unless specifically noted; all identity, viewpoint, and appearance consistency is enforced by the architecture and the noise-prediction loss.

5. Quantitative and Qualitative Evaluation

Comprehensive benchmarks demonstrate that PMMD architectures achieve superior performance compared to prior single-view or unimodal conditioning methods:

Method SSIM ↑ LPIPS ↓ FID ↓
T2I-Adapter 0.4755 0.4107 20.573
ControlNet 0.5263 0.3607 17.872
IP-Adapter 0.6186 0.3004 13.032
UPGPT 0.7085 0.2309 10.387
PMMD 0.7397 0.1909 8.5638

Multi-view user studies (real-to-generated, generated-to-real, subjective preference) consistently show PMMD preferred for realism and view consistency (Shang et al., 17 Dec 2025, Xie et al., 19 Nov 2025). Ablation studies reveal substantial contributions from joint input encoding, cross-view attention, appearance prior modules, and view masking. Increasing reference view count yields commensurate gains in FID and LPIPS, reflecting improved coverage of occluded or ambiguous regions (Shang et al., 17 Dec 2025, Xie et al., 19 Nov 2025).

On geospatial datasets (e.g., ShapeNet-Cars with EO, SAR, and LiDAR), cross-modality PMMD analogues achieve PSNR near 19.7 and SSIM near 0.87 (EO→EO), with scores remaining competitive for SAR→EO synthesis (Berian et al., 16 Jan 2025). Multi-view and multi-modal fusion reliably outperforms single-modality baselines.

6. Applications, Limitations, and Future Directions

PMMD is principally deployed for:

  • Virtual Try On: Synthesis of humans in arbitrary poses and garments, crucial for e-commerce and AR.
  • Digital Human Creation: Generating multi-view consistent, photorealistic avatars with identity and appearance preservation.
  • Geospatial Analysis & Multimodal Fusion: Cross-modality view generation (EO/SAR/LiDAR).
  • General Multi-View Synthesis: Consistent rendering of arbitrary objects or scenes from limited views.

Limitations include degraded robustness for extreme pose changes and handling of accessories or non-canonical shapes, as well as the need for further fine-grained garment/semantic segmentation for sharper boundaries. Planned directions include the integration of garment segmentation, explicit temporal consistency modules for video generation, and extension to text-image hybrid conditioning scenarios (Shang et al., 17 Dec 2025, Xie et al., 19 Nov 2025, Berian et al., 16 Jan 2025).

This suggests that as the field progresses, improvements in multimodal encoders, attention architectures, and conditional fusion will drive further gains in controllability, detail preservation, and flexibility.

PMMD can be situated relative to a spectrum of diffusion-based multi-view synthesis pipelines:

  • MVDiffusion (Tang et al., 2023): Simultaneous multi-view generation using shared UNet branches and CAA, suited where pixel-wise correspondences are known; achieves error-free, globally consistent output via weight-sharing and cross-branch attention.
  • CrossModalityDiffusion (Berian et al., 16 Jan 2025): Generalizes PMMD to arbitrary sensor modalities, utilizing geometry-aware volumetric rendering to unify disparate input types.
  • UPGPT (Cheong et al., 2023): Early unified diffusion model combining text, pose, and visual prompt conditioning for mask-less person image editing and pose transfer, with low-dimensional 3D pose and camera code for true view interpolation.

A plausible implication is that PMMD-style fusion—involving explicit, learned pose embedding, cross-view/self-attention, and multimodal adapter tokens—has become the new baseline for controllable, consistent multi-view person and scene synthesis, yielding state-of-the-art quantitative and perceptual results across a range of application domains.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Pose-guided Multi-view Multimodal Diffusion (PMMD).