Identity-Preserving Score Distillation

Updated 13 April 2026

The paper introduces a framework where pretrained diffusion models guide generation and editing while explicitly preserving key identity features in 2D/3D representations.
It details methods like Adaptive Human Distillation Sampling and identity-embedding losses that optimize facial, geometric, and stylistic fidelity using robust regularization and multi-view consistency.
The approach extends to data-free, few-step generative distillation and specialized editing (e.g., LUSD, IDS) to balance prompt fidelity with high-quality identity retention.

Identity-preserving score distillation encompasses a class of approaches for generative modeling, editing, and distillation where the aim is to leverage pretrained diffusion models as priors to guide new models (e.g., 3D humans, stylized images, efficient samplers, few-step generators) while explicitly retaining, reconstructing, or embedding the essential “identity” characteristics of the subject or the original distribution. This principle manifests across domains: in 2D/3D representation, identity cues might be facial structure, pose, or geometry; in data-free distillation, “identity” refers to preservation of the teacher’s score field and sample support. Recent research in identity-preserving score distillation spans specialized techniques for human-centric synthesis, robust 2D/3D editing, stylization, and highly efficient generative model distillation, with rigorous attention to metrics quantifying both prompt fidelity and identity retention.

1. Foundations of Score Distillation and Identity Losses

Score distillation sampling (SDS) forms the basis for most identity-preserving methods. In SDS, a differentiable renderer $g(\theta)$ (for a 3D model or image) is optimized to minimize the difference between the score predicted by a pretrained diffusion model and the one induced by the current render under noise. The canonical SDS gradient (for parameters $\theta$ ) is:

$\nabla_\theta L_{\mathrm{SDS}}(\theta) = \mathbb{E}_{t, \epsilon} [w(t) (\epsilon - \epsilon_\phi(x_t; y, t))^\top \nabla_\theta x_t(\theta)]$

where $x_t = \sqrt{\alpha_t} x + \sqrt{1-\alpha_t} \epsilon$ , $y$ is a conditioning prompt, and $w(t)$ is a diffusion-stage weight.

However, SDS by itself frequently induces identity drift—alterations not controlled by the prompt, due to biases or noise in the gradient field. Consequently, methods introduce explicit identity-aware losses, variational terms, or regularization to ensure that generated or edited samples preserve semantically salient identity attributes.

For example, in “GaussianIP,” an identity-aware score $\delta_{\mathrm{ip}}(x_t)$ is constructed via a face-focused diffusion prior conditioned on both text and an identity image, and an optional identity-embedding loss compares ArcFace embeddings between prompt and synthesized face crops (Tang et al., 14 Mar 2025). In editing scenarios (LUSD, IDS), identity is enforced by spatial regularization or fixed-point constraint on the diffusion posterior (e.g., matching Tweedie’s posterior mean of noisy renders to the original) (Chinchuthakun et al., 14 Mar 2025, Kim et al., 27 Feb 2025). In mask-free 3D and image editing, variational score-matching (Piva) matches the derived Fisher scores of the original and edited distributions, using LoRA-adapted U-Nets for tractable approximation (Le et al., 2024).

2. Advanced Human-Centric Methods: GaussianIP

“GaussianIP” introduces a two-stage architecture for identity-preserving 3D human generation from text and image prompts (Tang et al., 14 Mar 2025):

Adaptive Human Distillation Sampling (AHDS): This stage replaces the generic SDS score with an identity-aware score $\delta_{\mathrm{ip}}(x_t) = \epsilon_{\mathrm{ip}}^{s}(x_t; y, I_{\mathrm{ip}}, t) - \epsilon$ , where $\epsilon_{\mathrm{ip}}^{s}$ is a dedicated face-centric diffusion UNet. The model is conditioned jointly on text and a portrait image, and selectively weights region-specific terms (face/body). AHDS further leverages phase-aware timestep scheduling, partitioning optimization into geometry/base texture, mid-level, and fine-detail phases using a dual-Gaussian timestep schedule tailored to efficiently progress through the learning curriculum.
Identity-Embedding Loss: A 2D loss regularizes the distance between ArcFace features of rendered faces and the source portrait, ramped up during warm-up steps to avoid disrupting coarse structure.
View-Consistent Refinement (VCR): To remedy detail blurring across views, VCR iteratively refines a set of multi-view frames using cross-view mutual attention (injecting main-view attention into key-views) and distance-guided fusion for intermediate views. Consistency is locked in via a final image-to-3D reconstruction loss combining $\ell_1$ and LPIPS metrics.

Empirically, GaussianIP yields ≥83% identity match (Face++) and $\theta$ 0 ArcFace cosine similarity, reducing optimization steps by ∼30% relative to standard SDS and substantially outperforming baselines in identity, texture fidelity, and detail preservation user scores.

3. Score Distillation in 3D Stylization and Multi-View Editing

“Identity Preserving 3D Head Stylization with Multiview Score Distillation” (Bilecen et al., 2024) adapts score distillation to 3D stylization tasks. The approach optimizes a pre-trained 3D GAN (PanoHead) for both stylization (from a diffusion teacher) and identity coherence via:

Negative Log-Likelihood Distillation: Marginal likelihoods of renders from multiple camera poses are maximized under the teacher diffusion model; back-propagation through the noising process aligns the generator’s outputs with the stylized target distribution.
Multi-View Grid Score and Mirror Gradients: Simultaneous tile-based multi-view constraints and reflectional symmetry augment the gradient, ensuring identity cues remain consistent across all 360° views.
Score Rank Weighing: SVD-based rank reweighting suppresses undesirable color shifts by emphasizing dominant stylistic directions in the score tensor.

This pipeline preserves high ArcFace identity ( $\theta$ 1), low FID, and multi-view geometric consistency, all without auxiliary perceptual or explicit identity losses.

4. Data-Free Distillation and Efficient Generation: SiD and Extensions

Score identity Distillation (SiD) advances data-free, one-step or few-step distillation of large diffusion and flow-matching models (Zhou et al., 2024, Zhou et al., 2024, Zhou et al., 19 May 2025, Zhou et al., 29 Sep 2025). The core mechanism is the minimization of Fisher divergence between the teacher’s and student’s (generator’s) score fields:

$\theta$ 2

where $\theta$ 3 is the teacher’s score, and the student score is tied to generator outputs via Tweedie’s formula and conditional expectations.

SiD introduces three theoretical identities, leveraging semi-implicit distributions to avoid real data, and utilizes only synthesized generator images for both loss evaluation and score alignment. The generator $\theta$ 4 maps $\theta$ 5 directly to $\theta$ 6, bypassing iterative denoising. In practice, SiD’s bias-corrected loss and alternating optimization of a “fake” score net $\theta$ 7 produce incredibly efficient models: single-step sampling matches or surpasses teacher FID on several benchmarks (e.g., CIFAR-10 FID = $\theta$ 8 vs. teacher $\theta$ 9; ImageNet 64×64 FID = $\nabla_\theta L_{\mathrm{SDS}}(\theta) = \mathbb{E}_{t, \epsilon} [w(t) (\epsilon - \epsilon_\phi(x_t; y, t))^\top \nabla_\theta x_t(\theta)]$ 0 vs. $\nabla_\theta L_{\mathrm{SDS}}(\theta) = \mathbb{E}_{t, \epsilon} [w(t) (\epsilon - \epsilon_\phi(x_t; y, t))^\top \nabla_\theta x_t(\theta)]$ 1) (Zhou et al., 2024).

Adversarial Score identity Distillation (SiDA): SiDA augments SiD with an adversarial term tied to real data via the generator encoder, serving as discriminator, further improving efficiency and FID (ImageNet64 FID = $\nabla_\theta L_{\mathrm{SDS}}(\theta) = \mathbb{E}_{t, \epsilon} [w(t) (\epsilon - \epsilon_\phi(x_t; y, t))^\top \nabla_\theta x_t(\theta)]$ 2), and accelerating convergence by an order of magnitude (Zhou et al., 2024).

Few-Step and Flow Matching Extensions: SiD extends naturally to few-step generation, matching a uniform mixture of outputs at all intermediate steps (with proven optimality). It also applies to flow-matching generators (e.g., SANA, SD3, FLUX families) via the same Fisher-divergence-based loss and preserves “identity” in the sense of empirical sample and score alignment (Zhou et al., 19 May 2025, Zhou et al., 29 Sep 2025).

5. Specialized Editing: Mask-Free and Fixed-Point Regularized Techniques

LUSD (Localized Update Score Distillation) (Chinchuthakun et al., 14 Mar 2025): For text-driven image editing, LUSD introduces spatially modulated score distillation, focusing updates within regions targeted by prompt-specific cross/self-attention maps extracted from the diffusion UNet. Gradients are filtered by spatial std. dev. and normalized to ensure robust object insertion and background/identity preservation, outperforming other SDS variants in both user preference and prompt fidelity metrics.
IDS (Identity-preserving Distillation Sampling by Fixed-Point Iterator) (Kim et al., 27 Feb 2025): IDS compensates for SDS-induced identity drift by incorporating fixed-point iterative regularization. At each SDS step, the noisy latent is iteratively updated so the model’s posterior mean under the text-conditioned score approaches the known source image, thus correcting misaligned gradients and enforcing a local fixed-point property:

$\nabla_\theta L_{\mathrm{SDS}}(\theta) = \mathbb{E}_{t, \epsilon} [w(t) (\epsilon - \epsilon_\phi(x_t; y, t))^\top \nabla_\theta x_t(\theta)]$ 3

where $\nabla_\theta L_{\mathrm{SDS}}(\theta) = \mathbb{E}_{t, \epsilon} [w(t) (\epsilon - \epsilon_\phi(x_t; y, t))^\top \nabla_\theta x_t(\theta)]$ 4 is a regularizer driving the posterior mean to match the source. IDS thus yields higher IoU, background PSNR, and perceptual similarity in both 2D and 3D editing than classical Delta DDS or contrastive SDS.

Piva (Preserving Identity with Variational Score) (Le et al., 2024): Piva introduces a variational score-matching term based on a surrogate KL between the NeRF outputs pre- and post-edit