Papers
Topics
Authors
Recent
Search
2000 character limit reached

Condition Prior Preservation Loss (CPPL)

Updated 24 November 2025
  • CPPL is a loss function designed for few-shot personalization, preserving both text and pose control during fine-tuning of diffusion models.
  • It augments the standard reconstruction by matching outputs with the frozen base model using generic text and pose conditions.
  • Empirical results show CPPL reduces overfitting, enhances identity consistency, and maintains robust control across diverse poses.

Condition Prior Preservation Loss (CPPL) is a loss function introduced to enhance few-shot personalization of large text-to-image diffusion models by preserving both textual and pose control during fine-tuning. Originating in the context of pose-aware 3D avatar reconstruction from real-world images, CPPL generalizes prior-preservation principles beyond textual guidance to include explicit geometric cues, enabling personalized avatar models to maintain high-fidelity appearance and robust controllability across diverse poses and prompts (Xi et al., 17 Nov 2025).

1. Formal Definition

Within the ControlBooth stage of the PFAvatar pipeline, CPPL augments the usual diffusion-based reconstruction objective by anchoring the fine-tuned model's outputs to the behavior of the original pre-trained model under generic text and pose conditions. The total loss is expressed as: LtotalCB=Lrec+λcppl LcpplL_\text{total}^\text{CB} = L_{\text{rec}} + \lambda_\text{cppl}\,L_\text{cppl} where λcppl=1\lambda_\text{cppl}=1 by default.

Reconstruction Loss

The reconstruction term (Equation 1) operates on the few-shot set {Ii, Pi, Ti}\{I_i,\,P_i,\,T_i\}: Lrec=Ei,t,ϵ∥Dθ(αtIi+σtϵ; cti, cpi)−Ii∥22L_{\text{rec}} = \mathbb{E}_{i,t,\epsilon} \| D_\theta(\alpha_t I_i + \sigma_t \epsilon;\,c_{t_i},\,c_{p_i}) - I_i \|_2^2 where DθD_\theta is the denoising UNet, ctic_{t_i} and cpic_{p_i} are text and pose embeddings, tt indexes a random diffusion time step, and (αt,σt)(\alpha_t,\sigma_t) denote diffusion noise schedule scalars.

Condition Prior Preservation Loss

The CPPL term (Equation 2) utilizes pseudo-dataset samples (Ipri, Ppri, Tpri)(I_{\text{pr}_i},\,P_{\text{pr}_i},\,T_{\text{pr}_i}) generated via ancestral sampling from the frozen base model under random generic prompts and pose maps: Lcppl=Ei,t,ϵwt′∥Dθ(αtIpri+σtϵ; cprti, cprpi)−Ipri∥22L_\text{cppl} = \mathbb{E}_{i,t,\epsilon} w'_t\| D_\theta(\alpha_t I_{\text{pr}_i} + \sigma_t \epsilon;\,c_{\text{prt}_i},\,c_{\text{prp}_i}) - I_{\text{pr}_i} \|_2^2 Here, IpriI_{\text{pr}_i} is the output of the frozen base model given a sampled pose and text condition, and wt′w'_t is a time-dependent reweighting (typically matching the forward schedule). CPPL's explicit regularization constrains the finetuned model to match its pre-trained predecessor on the pseudo-examples, simultaneously anchoring both conditioning pathways.

2. Motivation and Intuition

Fine-tuning diffusion models with sparse supervision induces "memorization" and loss of control fidelity. Specifically, the model risks:

  • Collapsing to the limited set of training viewpoints or poses ("pose drift"),
  • Losing alignment to semantic prompts ("language drift").

DreamBooth-style prior-preservation strategies previously regularized language control alone. CPPL extends this prior matching to include the entire text/pose-conditioning tuple, preventing deviation of either control channel. Empirical evidence demonstrates that omitting CPPL results in overfitting: novel pose generations exhibit color shifts and restricted diversity, correlating with objective drops in identity/dissimilarity metrics (e.g., decreased CLIP-I body/head, lower PSNR, increased LPIPS) (Xi et al., 17 Nov 2025).

By enforcing that the finetuned network reconstructs the base model's outputs under generic prompts and poses, CPPL retains the ability to respond accurately to both arbitrary text descriptions and pose maps, even after strong personalization on limited data.

3. Role Within the Training Pipeline

CPPL is employed exclusively during the ControlBooth pose-aware fine-tuning phase. Each optimization iteration alternates between two objectives:

  1. Standard Reconstruction: Learn from real OOTD images, their pose maps (derived from ControlNet), and CLIP-captioned text descriptions.
  2. Prior Preservation: Simulate generic prompt/pose inference by generating pseudo-data via the frozen base model, then force the finetuned network to match the base model's response under the same conditioning.

An overview of the process:

Objective Inputs (Sampled/Generated) Loss Computed
Real Reconstruction (I,P,T)(I, P, T) from few-shot support LrecL_{\text{rec}}
Prior Preservation (Tpr,Ppr)(T_\text{pr}, P_\text{pr})+ IprI_{\text{pr}} (via frozen model) LcpplL_\text{cppl}

The total loss LtotalL_{\text{total}} is minimized with respect to the denoiser parameters, ensuring preservation of generic control capabilities while adapting to a given individual's appearance via LrecL_{\text{rec}}.

4. Implementation Details

  • Frameworks: The approach extends the HuggingFace Diffusers codebase, integrating an additional ControlNet branch for conditioning on pose.
  • Pose Encoding: The pose map PiP_i, a 2D skeleton heatmap, is encoded by a dedicated 4-layer convolutional network, FF, trained from scratch. Its output is injected into each UNet block via cross-attention.
  • Text Encoding: Prompts TiT_i are generated using GPT-4V and encoded via a frozen CLIP-text transformer Γ\Gamma.
  • Optimization: Key hyperparameters include λcppl=1.0\lambda_\text{cppl} = 1.0, time-dependent weights wt′w'_t equal to the standard forward diffusion weights, a batch size of 1–2 reflective of the few-shot regime, and learning rates set to 5×10−65 \times 10^{-6} for UNet parameters and 5×10−55 \times 10^{-5} for the pose encoder FF, determined via grid search. Fine-tuning is run for approximately 1,000 iterations (5 minutes on a single A100 GPU).

5. Empirical Performance and Analysis

Ablation studies and qualitative figures in (Xi et al., 17 Nov 2025) demonstrate that when CPPL is excluded, models exhibit overfitting to training poses (e.g., consistent snapping to memorized angles and color shifts). Including CPPL restores correct pose controllability and color consistency across novel camera views and conditions.

Quantitatively, removing CPPL reduces CLIP-I body/head metrics by approximately 5–8 points, reduces PSNR by over 2 dB, and increases LPIPS by more than 0.02 on the PuzzleIOI benchmark. User studies reported a 25-point preference gap for CPPL-trained models compared to those trained without the loss.

6. Limitations and Future Directions

CPPL assumes the frozen base diffusion model exhibits reliable performance for both text and pose priors. If the unadapted model is inadequately calibrated along either axis, CPPL may reinforce underlying weaknesses. In extremely low-shot scenarios ($1$–$2$ images), tuning the balance between LrecL_{\text{rec}} and LcpplL_\text{cppl} becomes critical, suggesting adaptive schedules for λcppl\lambda_\text{cppl} as a future avenue. CPPL currently operates on pixel-space MSE; integrating perceptually more aligned losses (e.g., CLIP-based or latent-space metrics) may yield perceptually sharper outputs. Extending CPPL to multi-modal conditioning (e.g., depth, normals, segmentation) is a natural progression for more semantically and geometrically robust avatar generation.

In summary, CPPL provides a principled solution to multi-condition prior preservation, simultaneously anchoring language and pose pathways to the base model, thereby mitigating drift in few-shot finetuning and enabling state-of-the-art 3D avatar personalization for real-world imagery (Xi et al., 17 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Condition Prior Preservation Loss (CPPL).