Principal Pose Guidance in Diffusion Models
- Principal Pose Guidance is a training-free, step-wise pose-conditioned control strategy that ensures human body structure is maintained during diffusion-based virtual try-on.
- It constructs a pose-preserving proxy image and latent representation that removes garment textures while supplying coarse cues for a new garment.
- Empirical evaluations show that aligning top principal components of the latent state improves structural fidelity and image coherence over prior methods.
Principal Pose Guidance denotes a pose-conditioned control strategy in which pose functions as the structurally dominant guidance signal. In the most explicit formulation bearing this name, OmniVTON++ defines Principal Pose Guidance (PPG) as a training-free, step-wise pose guidance mechanism operating inside the diffusion sampling loop: it constructs a pose-preserving but garment-agnostic proxy image, encodes that proxy into a latent , and, at each diffusion step, selects codebook noise so that the denoising trajectory aligns the principal components of the current prediction with the proxy latent, thereby enforcing pose while leaving appearance degrees of freedom for the target garment (Yang et al., 16 Feb 2026). Across later work, the phrase also appears in a broader sense to describe systems in which pose is treated as the primary structural or supervisory signal, rather than merely an auxiliary condition (Trinh et al., 2023).
1. Definition and design rationale
In OmniVTON++, PPG is introduced to address a specific failure mode of training-free virtual try-on: the need to preserve the human body structure and pose of the input person image while replacing the garment appearance with a new garment. If the original person image is used naively as a pose source, for example through DDIM inversion or direct latent replacement, the diffusion model tends to carry over the old clothing appearance together with the pose. This is particularly problematic in a training-free setting, because no retraining is available to disentangle pose from garment texture (Yang et al., 16 Feb 2026).
The mechanism is motivated against two immediate alternatives. First, the previous OmniVTON system used Spectral Pose Injection (SPI), which injects pose only at initialization by mixing a pose-aware inverted latent with random noise. That approach preserves pose only to some extent, but once sampling begins there is no persistent structural control, and pose drifts or local limb misalignments can appear. Second, prior pose-conditioned diffusion systems such as ControlNet require trained control branches, whereas PPG is explicitly described as plug-and-play and training-free, operating through diffusion states and external analysis signals rather than retraining (Yang et al., 16 Feb 2026).
The distinctive claim of PPG is therefore not merely that it uses pose, but that it applies pose regulation at every sampling step, and does so through a proxy latent that contains pose and coarse garment-region information while excluding the original garment’s fine appearance. The additional restriction to the principal structural components of the latent is intended to prevent over-constraining garment appearance (Yang et al., 16 Feb 2026).
2. Proxy latent construction and formal mechanism
PPG does not inject 2D keypoints or DensePose maps directly into the diffusion model. Instead, pose is represented implicitly in a constructed proxy image that preserves the person’s body pose and layout, removes the original garment texture, and inserts a coarse target-garment region cue. The diffusion model’s VAE then maps this image to a latent
This makes pose a latent-space structural signal rather than a direct geometric input (Yang et al., 16 Feb 2026).
The proxy image is built by overwriting the original person image in a fixed order of regions. In garment regions , the original garment is removed by inpainting,
A body region
is then filled with a constant skin color ,
The target-garment region is defined as
and is filled with a constant target color 0,
1
All remaining pixels are preserved: 2 The result is a piecewise constant composite image that preserves pose and silhouette, removes original garment structure and fine texture, and marks where the new garment should appear (Yang et al., 16 Feb 2026).
PPG is built on the DDCM sampling idea. In the standard codebook formulation,
3
where 4 is a timestep-specific codebook. DDCM selects a codebook entry by aligning it with the residual to a target latent 5: 6 In virtual try-on, the true target latent is unknown, so PPG replaces 7 with 8, but only after restricting the current prediction 9 to its top principal components. Let 0 denote the vector formed by the top principal components of 1. Pose-guided noise selection then becomes
2
The selected codebook element 3 replaces the raw stochastic noise term in the diffusion update (Yang et al., 16 Feb 2026).
The formal interpretation given in the paper is that the leading principal components of 4 correspond to global structural (pose/shape) information, whereas the remaining components carry fine-grained appearance details. This suggests that principal-component restriction is intended to preserve structural regulation without forcing the generated garment to resemble the proxy’s deliberately texture-free appearance (Yang et al., 16 Feb 2026).
3. Placement inside OmniVTON++ and implementation
OmniVTON++ organizes virtual try-on into three major components: Structured Garment Morphing (SGM), Principal Pose Guidance (PPG), and Continuous Boundary Stitching (CBS / CBS-DiT). SGM builds a geometry-aligned coarse garment prior 5 through part-wise homography. PPG operates in the second stage, garment-infused image inpainting, to keep body structure consistent with the person pose during diffusion sampling. CBS or CBS-DiT then refines boundaries between morphed garment parts and surrounding regions (Yang et al., 16 Feb 2026).
In the inpainting stage, the diffusion model is conditioned on
6
where 7 is the person image with the morphed garment prior injected into the masked region, 8 is the cloth-agnostic mask, and 9 is the text prompt. Sampling begins from random noise 0. For each timestep 1, the model predicts 2 and 3, PPG performs pose-guided noise selection using 4 and 5, and the latent is updated by replacing the stochastic noise term with the selected codebook vector. In algorithmic form, the workflow is: 0 PPG therefore continuously regulates pose, while SGM supplies garment geometry and CBS regulates boundary coherence (Yang et al., 16 Feb 2026).
The method is implemented on two diffusion backbones: Stable Diffusion v2.0 (SD-2.0), a U-Net latent diffusion model, and FLUX.1 Fill, a DiT inpainting backbone. The paper states that PPG is applied identically in concept on both because it operates at the latent sampling level and does not require architecture changes. It does not modify network weights or define a new training loss; it changes how noise is chosen at each step. The implementation uses off-the-shelf OpenPose, DensePose, human parsing, and an inpainting operator for background removal (Yang et al., 16 Feb 2026).
Reported hyperparameters are specific. The timestep codebook size is 6. The number of principal components used in 7 is 3 per timestep. Sampling uses a DDIM sampler with 50 steps for SD-2.0 and an SDE variant of DPM-Solver++ with 30 steps for FLUX. The paper emphasizes that PPG is implemented by replacing the stochastic noise term in the sampler with pose-guided noise while keeping the underlying solver unchanged (Yang et al., 16 Feb 2026).
4. Empirical behavior and ablation evidence
The ablations in OmniVTON++ attribute a distinct structural role to PPG. In the macro ablation, adding PPG only to the base system improves structural and perceptual metrics on both backbones. On SD-2.0, the base variant reports 8 and 9, whereas the PPG-only variant reports 0 and 1. On FLUX, the corresponding values change from 2, 3 to 4, 5. The paper interprets the SSIM increase as a better global structural match to ground truth and the LPIPS decrease as sharper, more coherent imagery; the visual examples show correction of limb misalignment and unnatural poses visible without PPG (Yang et al., 16 Feb 2026).
A more targeted comparison evaluates several pose-guidance variants on VITON-HD with the SD-2.0 backbone. The reported results are as follows.
| Variant | Metrics | Brief note |
|---|---|---|
| ControlNet | 6, 7 | trained control branch |
| SPI | 8, 9 | initialization only |
| Full-Latent | 0, 1, 2 | good structure, some FID degradation |
| Low-Frequency Latent | 3, 4 | fixed low-frequency cutoff |
| PPG | 5, 6, 7, 8 | principal-component guidance |
The paper’s interpretation is specific. ControlNet helps, but is limited by multi-modal conditioning conflicts and training dependencies. SPI gives good 9 but less structural control because it acts only at initialization. Full-Latent guidance over-constrains appearance to match the proxy and hurts realism. Low-Frequency Latent guidance is sensitive to a fixed low-frequency cutoff across timesteps. PPG is presented as balancing good global structure, good perceptual quality, and reasonable FID. Figure 1 is cited as evidence that principal components capture pose structure more compactly than low-frequency components (Yang et al., 16 Feb 2026).
Within the complete OmniVTON++ system, the paper further states that combining SGM, PPG, and CBS/CBS-DiT yields state-of-the-art or second best numbers across VITON-HD, DressCode, and the StreetTryOn benchmark, and that the framework supports not only single-garment, single-human cases but also multi-garment, multi-human, and anime character virtual try-on (Yang et al., 16 Feb 2026).
5. Related formulations in pose-guided generation and representation learning
Outside virtual try-on, related work uses closely aligned ideas in different technical forms. In RePoseDM, pose guidance is split between recurrent pose alignment, which produces pose-aligned texture features used as conditional guidance, and gradient guidance from pose interaction fields, which shapes the denoising trajectory toward the valid pose manifold and away from source-pose leakage (Khandelwal, 2023). In TCAN, pose is the primary control signal for human image animation through a frozen OpenPose ControlNet, a LoRA-based Appearance–Pose Adaptation layer, Temporal ControlNet, and a Pose-driven Temperature Map that uses pose coverage to stabilize background regions over time (Kim et al., 2024). In DisPose, pose is explicitly disentangled into motion field guidance and keypoint correspondence, so that dense region-level motion and identity-related feature transfer are both derived from sparse skeleton pose without relying on external dense conditions (Li et al., 2024).
A structurally related but architecturally distinct formulation appears in ASTRA, which argues that pose should be the primary, structurally dominant guidance signal in multi-subject generation. Its RAG-Pose pipeline provides an explicit structural prior from a curated database, and Enhanced Universal Rotary Position Embedding (EURoPE) gives identity tokens layout-free positions while binding pose tokens to the canvas (Xia et al., 15 Apr 2026). A different training-time interpretation is provided by PGDS for clothes-changing person re-identification, where pose functions as the principal supervisory signal through a frozen pose teacher, a human encoder, and a Pose-to-Human Projection module that applies layer-wise guide losses while leaving inference cost unchanged (Trinh et al., 2023).
These formulations are not identical to PPG in OmniVTON++, but they indicate a broader pattern. This suggests that “principal pose guidance” can designate at least three families of methods: sampling-time structural control in diffusion, architectural disentanglement of structure and appearance, and training-time deep supervision in which pose organizes the learned feature hierarchy.
6. Limitations and recurring issues
In OmniVTON++, the main limitations of PPG arise from its dependence on human analysis signals used during proxy construction. The proxy relies on DensePose, human parsing, garment masks, and DensePose-projected garment masks. If DensePose or parsing fails, for example under extreme poses, strong occlusions, or cluttered backgrounds, the proxy image can become structurally incorrect, which can in turn produce a wrong target-garment region or distorted limbs in the final try-on. The paper also notes that accessories such as necklaces can fall into the inpainting or proxy regions and be removed because cloth-agnostic masks are coarse and skeleton-based. The stated future directions include more robust or jointly optimized human analysis modules, segment-anything-like refinement of masks, and improved proxy construction or learned proxies while keeping sampling training-free (Yang et al., 16 Feb 2026).
A broader reading of the literature suggests that this dependence on upstream pose or structure estimation is recurrent rather than specific to OmniVTON++. MimicMotion remains dependent on 2D pose quality and explicitly notes that high-confidence but wrong keypoints can still misguide generation (Zhang et al., 2024). DisPose reports limitations for unseen parts, complex backgrounds, and very extreme body-shape or clothing differences despite replacing external dense guidance with reference-based motion propagation (Li et al., 2024). PGDS likewise treats inaccurate pose as a potentially misleading teacher signal because the pose encoder is frozen and the guide loss is strong (Trinh et al., 2023).
Within this broader landscape, PPG in OmniVTON++ is notable because it narrows pose control to the principal structural components of the latent instead of enforcing full latent agreement with a proxy. The broader implication is suggested rather than stated outright: if pose is to remain the dominant structural signal without collapsing appearance diversity, the control mechanism must preserve a separation between structure-bearing directions and appearance-bearing degrees of freedom. That separation is implemented in OmniVTON++ through proxy design, PCA-restricted residual alignment, and step-wise codebook noise selection (Yang et al., 16 Feb 2026).