Stable-Hair v2: Multi-View Hair Transfer

Updated 4 July 2026

Stable-Hair v2 is a diffusion-based multi-view framework that transfers hairstyles onto portraits while preserving identity and background consistency.
It employs a three-stage training strategy with a pose-controllable Latent IdentityNet, a hair extractor using cross-attention, and temporal attention modules for improved inter-view coherence.
Its multi-view data pipeline generates aligned source–bald pairs using a diffusion-based bald converter and data-augmented inpainting to minimize color drift and artifacts.

Searching arXiv for the Stable-Hair and Stable-Hair v2 papers to ground the article in the published record. Stable-Hair v2 is a diffusion-based framework for multi-view hairstyle transfer that takes one source subject and a reference hairstyle, then synthesizes photorealistic outputs across many viewpoints while preserving identity and ensuring cross-view consistency. It extends the single-view Stable-Hair formulation by introducing a multi-view training data generation pipeline, a multi-view diffusion model with pose conditioning and temporal attention, and a three-stage training strategy designed to disentangle view control, hair synthesis, and inter-view coherence (Sun et al., 10 Jul 2025). The system is positioned as a successor to "Stable-Hair: Real-World Hair Transfer via Diffusion Model" (Zhang et al., 2024), which established the two-stage single-view paradigm based on a Bald Converter, Hair Extractor, Hair Cross-Attention, and a latent-space identity-preservation module.

1. Scope, lineage, and task definition

Stable-Hair v2 addresses the task of transferring the hair attributes of a reference image, including color, length, curvature, and structure, onto a source portrait while preserving the source identity and background across multiple camera viewpoints (Sun et al., 10 Jul 2025). The framework is explicitly motivated by digital humans and avatars, for which inconsistencies across views produce flicker, shape drift, and color drift.

The earlier Stable-Hair system formulated hair transfer as a two-stage latent diffusion pipeline. In Stage 1, a Bald Converter removed hair from the source face to produce a bald proxy. In Stage 2, a Hair Extractor encoded reference hairstyle features, Hair Cross-Attention injected those features into a Stable Diffusion U-Net, and a Latent IdentityNet preserved identity and background (Zhang et al., 2024). Stable-Hair v2 preserves that latent-space philosophy, but generalizes it to multi-view generation by adding explicit pose conditioning and sequence-aware attention.

A common misconception is to treat Stable-Hair v2 as a 3D reconstruction method. The published description does not present it as a NeRF- or Gaussian-based system. Instead, it generates consistent multi-view images directly without building an explicit 3D representation; this yields faster asset creation and fewer assumptions, but also means that exact 3D consistency is harder to enforce for extreme views or backgrounds (Sun et al., 10 Jul 2025).

2. Multi-view data generation pipeline

A central component of Stable-Hair v2 is its multi-view training data generation pipeline, which produces multi-view triplets and aligned source–bald pairs (Sun et al., 10 Jul 2025). The pipeline contains three modules.

First, a diffusion-based Bald Converter transforms the source portrait $I_s$ into a bald proxy $I_b$ . This module inherits the core role of the Bald Converter in Stable-Hair v1: it standardizes the source input by removing conflicting hair content while preserving identity and background as much as possible (Zhang et al., 2024). In v2, the latent-space design is retained because pixel-space ControlNet cascades were associated with color drift, whereas the latent formulation improved CLIP-I, FID, PSNR, SSIM, and IDS in the reported ablation (Sun et al., 10 Jul 2025).

Second, a data-augmented inpainting model generates reference hairstyle images $I_r$ . The stated motivation is that inpainting may leak background or identity information into $I_r$ , which can mislead the hair encoder. To reduce this effect, the method generates approximately $100$ diverse backgrounds via Stable Diffusion prompts, composites the face-plus-hair region of $I_s$ with those backgrounds, inpaints non-hair regions using Stable Diffusion guided by prompts from ChatGPT, and samples $10$ augmented references per source during training (Sun et al., 10 Jul 2025).

Third, a face-finetuned multi-view diffusion model generates aligned multi-view source–bald pairs $\{(I_s^i, I_b^i)\}_{i=1}^K$ . The paper states that an SV3D model is fine-tuned on approximately $20\text{k}$ multi-view face videos with $21$ viewpoints, synthesized using a state-of-the-art 3D-aware GAN, and then used to generate the aligned pairs required for pose-controllable training (Sun et al., 10 Jul 2025).

This pipeline suggests that Stable-Hair v2 treats data construction not as an auxiliary preprocessing step but as part of the method’s core technical contribution. A plausible implication is that multi-view consistency depends not only on model architecture but also on the availability of aligned bald/source supervision under controlled pose variation.

3. Architecture and conditioning mechanisms

The multi-view hair transfer model is built around three major components: a Latent IdentityNet, a Hair Extractor, and temporal attention layers inserted into the diffusion U-Net (Sun et al., 10 Jul 2025). The backbone is Stable Diffusion v1.5 with a VAE encoder–decoder, and inference uses DDIM with $I_b$ 0 steps and classifier-free guidance scale $I_b$ 1.

The Latent IdentityNet is a pose-controllable module that operates in latent space and preserves identity, geometry, and background across viewpoints. During Stage 1 training, it takes a bald image at one view and predicts the source image at another view given pose parameters. This continues the latent-space identity-preservation role played by the Latent IdentityNet in Stable-Hair v1, where the module conditioned Stage 2 on the bald proxy to minimize drift in non-hair regions (Zhang et al., 2024).

The Hair Extractor is a reference encoder/U-Net that extracts hairstyle features from the reference image via self-attention layers and injects them into the main U-Net through hair cross-attention. The conditioning pattern follows the Stable-Hair v1 design: the reference-derived features serve as keys and values, and the main U-Net activations serve as queries, allowing fine-grained hairstyle details to be transferred into the denoising trajectory (Zhang et al., 2024).

Stable-Hair v2 adds temporal attention layers after each U-Net block. These layers operate along the view dimension rather than the spatial dimension, aggregating features across adjacent frames or viewpoints to reduce temporal artifacts and enforce view-consistent hair structure (Sun et al., 10 Jul 2025). The paper explicitly relates this design to SV3D and AnimateDiff.

Pose is represented by polar and azimuth angles, with an auxiliary $I_b$ 2 described as noise augmentation of camera pose. The fused conditioning is

$I_b$ 3

where $I_b$ 4 is a sinusoidal embedding and $I_b$ 5 denotes concatenation (Sun et al., 10 Jul 2025). This fused embedding is injected through the usual Stable Diffusion time-embedding pathway.

4. Training strategy and optimization

Stable-Hair v2 uses a multi-stage training strategy consisting of pose-controllable Latent IdentityNet training, Hair Extractor training, and temporal attention training (Sun et al., 10 Jul 2025). The ordering is deliberate.

In Stage 1, the model trains the pose-controllable Latent IdentityNet on multi-view bald images and corresponding target views. The objective is to learn view control and identity preservation without the additional complexity of hairstyle synthesis. The conditioning includes the bald latent and the pose embedding, while the reference conditioning is absent.

In Stage 2, the Hair Extractor and hair cross-attention parameters are trained. The paper states that one bald image $I_b$ 6 and a random reference image $I_b$ 7 are used to reconstruct the corresponding target image $I_b$ 8 at the same view, while IdentityNet is frozen (Sun et al., 10 Jul 2025). This isolates hairstyle transfer from pose disentanglement.

In Stage 3, the temporal attention modules alone are trained on random $I_b$ 9-frame chunks, with IdentityNet and Hair Extractor frozen. The goal is to improve inter-view coherence rather than per-view photorealism.

The optimization objective throughout is the standard latent diffusion noise-prediction loss:

$I_r$ 0

where $I_r$ 1 is the source condition, $I_r$ 2 is the reference condition, and $I_r$ 3 is the viewpoint condition (Sun et al., 10 Jul 2025). The paper does not indicate using auxiliary identity, hairstyle, or temporal consistency losses; those effects are attributed to the architecture and the curriculum.

The reported implementation details are specific. Stage 1 uses $I_r$ 4 H800 GPU, batch size $I_r$ 5, learning rate $I_r$ 6, and $I_r$ 7 steps. Stage 2 uses $I_r$ 8 H800 GPUs, batch size $I_r$ 9, learning rate $I_r$ 0, and $I_r$ 1 steps. Stage 3 uses $I_r$ 2 H800, batch size $I_r$ 3, sequence length $I_r$ 4, learning rate $I_r$ 5, and $I_r$ 6 steps (Sun et al., 10 Jul 2025).

5. Evaluation and empirical results

The evaluation covers both single-view and multi-view settings, with comparisons against Barbershop, SYH, HairCLIP, HairCLIPV2, HairFastGAN, HairFusion, StableHair v1, and multi-view pipelines formed by combining single-view hair transfer methods with SV3D (Sun et al., 10 Jul 2025). The metrics reported are FID, PSNR, SSIM, IDS with InsightFace, and CLIP-I. For multi-view evaluation, the paper also uses head-motion alignment via 3DMM parameters.

The main quantitative results are summarized below.

Setting	Metric	Stable-Hair v2
Single-view	CLIP-I	0.431
Single-view	FID	35.125
Single-view	PSNR	30.671
Single-view	SSIM	0.673
Single-view	IDS	0.778
Multi-view	CLIP-I	0.411
Multi-view	FID	32.170
Multi-view	PSNR	26.347
Multi-view	SSIM	0.490
Multi-view	IDS	0.683

In the single-view comparison, Stable-Hair v2 reports the highest CLIP-I, best FID, best PSNR, best SSIM, and best IDS among the listed baselines (Sun et al., 10 Jul 2025). In the multi-view comparison, it reports the best or near-best CLIP-I, FID $I_r$ 7, PSNR $I_r$ 8 as second, SSIM $I_r$ 9 as second, and the best IDS $100$0.

The ablation study isolates several design choices. The latent version of the control module outperforms the pixel-space version, improving CLIP-I from $100$1 to $100$2, FID from $100$3 to $100$4, PSNR from $100$5 to $100$6, SSIM from $100$7 to $100$8, and IDS from $100$9 to $I_s$ 0 (Sun et al., 10 Jul 2025). Temporal attention improves reconstructed-sequence coherence, with PSNR $I_s$ 1 versus $I_s$ 2, SSIM $I_s$ 3 versus $I_s$ 4, and LPIPS $I_s$ 5 versus $I_s$ 6 (Sun et al., 10 Jul 2025). The qualitative analysis attributes cleaner hair boundaries to data augmentation and better pose control to the stage-wise training schedule.

User studies also favor the method. The paper reports that v2 ranked best in Accuracy, Preservation, and Naturalness in the single-view study, and best in multi-view Accuracy, Smoothness, and Naturalness in the multi-view study (Sun et al., 10 Jul 2025).

6. Limitations, reproducibility, and relation to adjacent methods

The paper identifies several limitations. Background reconstruction exhibits less robust temporal coherence for complex backgrounds, and artifacts may appear between views if background features are inconsistent across viewpoints. Training emphasizes front-facing views, so extreme angles may degrade performance. Extreme occlusions, accessories, and rare hairstyles remain difficult cases (Sun et al., 10 Jul 2025). These limitations are consistent with the system’s image-based multi-view formulation rather than an explicit geometric representation.

Stable-Hair v2 is reproducible in a stronger sense than the original Stable-Hair paper. The original work linked to a project page, but public training or inference code was not explicitly provided in the paper (Zhang et al., 2024). By contrast, Stable-Hair v2 states that code is publicly available at https://github.com/sunkymepro/StableHairV2 (Sun et al., 10 Jul 2025). The reported resources include an SD v1.5 backbone, Bald Converter and SV3D face-finetuned checkpoints provided in the repository, and a typical workflow consisting of bald conversion followed by multi-view inference (Sun et al., 10 Jul 2025).

Within the Stable-Hair lineage, the conceptual continuity is clear. Stable-Hair v1 introduced the bald-first decomposition, the reference-driven Hair Extractor, Hair Cross-Attention, and latent-space identity control to mitigate color drift and preserve non-hair regions (Zhang et al., 2024). Stable-Hair v2 retains those core modules but adds polar-azimuth embeddings, temporal attention, a face-finetuned multi-view diffusion prior, and a staged curriculum explicitly targeted at view consistency (Sun et al., 10 Jul 2025).

From an encyclopedic perspective, Stable-Hair v2 can therefore be understood as the multi-view generalization of Stable-Hair rather than a wholly separate architecture. Its defining contribution is not merely higher fidelity on single portraits, but the introduction of a training and conditioning framework for producing consistent hairstyle transfers across multiple viewpoints suitable for digital humans and virtual avatars (Sun et al., 10 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Stable-Hair v2: Real-World Hair Transfer via Multiple-View Diffusion Model (2025)

Stable-Hair: Real-World Hair Transfer via Diffusion Model (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stable-Hair v2.