StableAvatar: Audio-Driven & Consistent Avatars

Updated 4 July 2026

StableAvatar is a research paradigm that ensures stable identity, temporal coherence, and multi-view consistency across synthesized avatars.
It employs innovative mechanisms like time-step-aware audio adapters, dual diffusion supervision, and geometric anchoring to suppress drift and maintain smooth transitions.
Quantitative benchmarks on datasets such as HDTF and AVSpeech demonstrate its superior performance in FVD, CSIM, and synchronization metrics compared to previous methods.

StableAvatar denotes a line of avatar research in which stability is treated as a primary systems objective rather than a secondary by-product of synthesis quality. In the narrow sense, the term refers to the end-to-end video diffusion transformer introduced in “StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation,” which conditions on a reference image and audio to synthesize infinite-length high-quality videos without post-processing (Tu et al., 11 Aug 2025). In a broader technical sense, recent work uses “StableAvatar” as a design target for systems that must remain identity-consistent across views, robust under long temporal horizons, controllable under reenactment, and geometrically coherent under sparse or monocular supervision, as exemplified by Arc2Avatar, AvatarBooth, TexAvatars, Autoregressive Appearance Prediction for 3D Gaussian Avatars, JacobianAvatar, and AniArtAvatar (Gerogiannis et al., 9 Jan 2025, Zeng et al., 2023, Lee et al., 24 Dec 2025, Steiner et al., 1 Apr 2026, Won et al., 30 Jun 2026, Li, 2024).

1. Scope and technical meaning

Taken together, these works suggest that “stability” in avatar research spans at least four distinct regimes. The first is long-horizon temporal stability, where appearance, identity, color, and lip synchronization must not drift over thousands of frames. The second is multi-view identity stability, where single-image or few-image avatar reconstruction must preserve the subject across frontal, side, and back views without Janus artifacts or color shifts. The third is deformation stability, where rigged or driven avatars must remain coherent under large expressions, extreme head poses, or semi-rigid body motion. The fourth is control stability, where external conditioning signals such as audio, pose, text, or motion sequences should produce smooth, interpretable responses rather than contradictory supervision (Tu et al., 11 Aug 2025, Gerogiannis et al., 9 Jan 2025, Lee et al., 24 Dec 2025, Won et al., 30 Jun 2026).

This broader reading is supported by the failure modes emphasized across the literature. StableAvatar attributes long-video degradation to latent distribution error accumulation caused by injecting third-party audio embeddings into a diffusion backbone that lacks audio-related priors (Tu et al., 11 Aug 2025). Arc2Avatar identifies instability in score-distillation-based single-image head generation, especially color shift and view inconsistency, and addresses it with a strong identity prior, targeted initialization, and masked 3D Gaussian Splatting (Gerogiannis et al., 9 Jan 2025). AvatarBooth frames the main problem as the incompatibility between high-frequency facial identity, full-body clothing detail, and uncontrolled pose in casual photos, motivating dual diffusion heads and pose-consistent constraints (Zeng et al., 2023). TexAvatars, AAP-3DGA, and JacobianAvatar each analyze different manifestations of instability under reenactment or driving: cross-triangle discontinuities, one-to-many pose-to-appearance ambiguity, and monocular occlusion-induced deformation drift, respectively (Lee et al., 24 Dec 2025, Steiner et al., 1 Apr 2026, Won et al., 30 Jun 2026).

A distinct antecedent appears in social XR rather than generative modeling. “Moving Avatars and Agents in Social Extended Reality Environments” introduces Smart Avatars and Stuttered Locomotion to preserve continuous, intelligible user representation under noncontinuous locomotion, emphasizing bystander spatial awareness and reduced cybersickness rather than photorealistic synthesis (Freiwald et al., 2023). This suggests that avatar stability has long had both perceptual and geometric meanings.

2. Infinite-length audio-driven avatar video generation

The most specific use of the name is the video model “StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation” (Tu et al., 11 Aug 2025). The paper presents the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing, using a Video Diffusion Transformer based on Wan2.1 I2V 1.3B, a frozen 3D VAE encoder/decoder, CLIP image embeddings for identity preservation, and Wav2Vec 2.0 audio features refined by a dedicated audio module (Tu et al., 11 Aug 2025).

StableAvatar adopts Rectified Flow and trains the denoiser to predict velocity. Its forward process is

$\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{e}, \quad t\in[0,1],\;\mathbf{e}\sim\mathcal{N}(\mathbf{0},\mathbf{I}).$

The central claim of the paper is that long-horizon failure is primarily an audio-modeling problem. Existing methods inject off-the-shelf audio embeddings through cross-attention into a backbone with no joint audio prior; StableAvatar instead introduces a Time-step-aware Audio Adapter that modulates audio features with the diffusion time-step embeddings and explicitly cross-attends them to the latent state, thereby tying audio to the denoising trajectory itself (Tu et al., 11 Aug 2025). At inference, this is complemented by Audio Native Guidance, which uses the model’s own evolving joint audio–latent prediction as guidance, and by a Dynamic Weighted Sliding-window Strategy, which fuses overlapping clip latents with a logarithmic weighting kernel:

$w(\tau) = \frac{\ln\big(1 + u\cdot(\mathrm{e}-1)\big)}{\ln(\mathrm{e})}, \quad u=\frac{\tau}{m-1}.$

The reported quantitative results are framed around three benchmarks: HDTF, AVSpeech, and Long100. StableAvatar reports FID/FVD/CSIM/Sync-C/Sync-D/IQA/ASE of 38.14/375/0.875/8.15/6.94/3.90/2.46 on HDTF, 68.12/640/0.872/7.56/7.85/3.79/2.32 on AVSpeech, and 57.18/504/0.849/8.24/6.79/3.84/2.39 on Long100 (Tu et al., 11 Aug 2025). On Long100, it is compared directly to OmniAvatar with CSIM 0.849 vs 0.471, Sync-C 8.24 vs 4.45, FVD 504 vs 1621, and FID 57.18 vs 168.49 (Tu et al., 11 Aug 2025).

The ablation results are central because they operationalize what stability means in this setting.

Configuration	FVD	CSIM
w/o Audio Adapter	1802	0.457
w/o Guidance	866	0.822
w/o DWSW	718	0.845
Ours	504	0.849

These numbers show that the adapter is the dominant mechanism for preventing drift, guidance provides an additional improvement, and sliding-window fusion further improves smoothness (Tu et al., 11 Aug 2025). The time-local analysis is equally important: without the proposed adapter, late frames between 3500 and 3700 degrade from FVD 865 to 2388, CSIM 0.836 to 0.405, Sync-C 7.66 to 3.78, and CIEDE 0.536 to 2.318; with the adapter and guidance, late-frame metrics remain close to early-frame metrics, including FVD 478, CSIM 0.846, Sync-C 8.28, and CIEDE 0.166 (Tu et al., 11 Aug 2025). In this literature, StableAvatar therefore names a specific attempt to keep the denoising trajectory anchored to a joint audio–latent distribution over arbitrarily long horizons.

3. Single-image 3D head avatars and identity-stable reconstruction

In single-image 3D avatar generation, “StableAvatar” is used more as a systems template than as a paper title. Arc2Avatar is explicitly summarized as a practical blueprint for a StableAvatar system focused on stability, identity preservation, natural color, expressivity via blendshapes, and multi-view consistency (Gerogiannis et al., 9 Jan 2025). Its pipeline uses ArcFace-conditioned Arc2Face guidance, LoRA-based diverse-view augmentation, Interval Score Matching instead of classic SDS, and a 3D Gaussian Splatting representation anchored one-to-one to a subdivided FLAME template (Gerogiannis et al., 9 Jan 2025).

The method’s stability claim rests on three coupled decisions. First, Arc2Face supplies a strong identity prior, with identity-conditioned CLIP tokens blended with simple view tokens as $c_d = b\cdot c_{\text{default}} + (1-b)\cdot c_{\text{view}}$ using $b=0.85$ , while LoRA scale $s_L=0.45$ balances diverse-view generation with identity (Gerogiannis et al., 9 Jan 2025). Second, Arc2Avatar replaces standard SDS with ISM, enabling a guidance scale of 1 together with Perp-Neg, which the paper argues avoids the common SDS oversaturation and color shift while preserving detail (Gerogiannis et al., 9 Jan 2025). Third, the face region is kept in dense correspondence with FLAME by assigning one Gaussian per vertex on a subdivided template and protecting facial splats from densification, pruning, and opacity reset (Gerogiannis et al., 9 Jan 2025).

Expression control follows the FLAME expression basis directly:

$V(e) = V_0 + B e.$

Because facial splats are template-aligned, expressions are transferred by moving Gaussian means along the per-vertex displacements of the driven mesh, and an optional short ISM correction run can refine difficult mouth interiors such as teeth and tongue (Gerogiannis et al., 9 Jan 2025). The implementation details are unusually specific: 512×512 renders, 6000 iterations, batch size 4, a single RTX 4090 (24GB), roughly 80 minutes per avatar, and about 110K Gaussians in the final model (Gerogiannis et al., 9 Jan 2025).

The reported performance is used to support the broader StableAvatar design claim. Arc2Avatar reports FID 144.58 versus HumanNorm 173.02, ID-to-3D 154.51, Magic123 159.21, DreamCraft3D 186.98, TADA 213.39, DreamFace 214.58, and Fantasia3D 280.32, together with a 93% user preference over ID-to-3D, Magic123, and DreamCraft3D (Gerogiannis et al., 9 Jan 2025). The paper also states that Identity Similarity Distribution shows the highest mean and lowest variance across views among SDS baselines (Gerogiannis et al., 9 Jan 2025). In this setting, stability means that low guidance can be used without losing detail, provided that identity priors and initialization are sufficiently strong.

AniArtAvatar addresses a different corner of the same problem space. It constructs an animatable 3D-aware art avatar from a single neutral portrait by using Wonder3D to synthesize six views with colors and normals at 256×256, reconstructing a static NeuS-like SDF avatar, then driving expression through 2D landmark detection, 3D lifting onto the implicit surface, and cage-based head and torso control (Li, 2024). Its quantitative evaluation on a Disney-style cartoon dataset reports FID 23.756 and CPBD 0.1207, with failure modes including extreme profile inconsistencies, hair warps, and poor open-mouth interiors (Li, 2024). This suggests that single-image StableAvatar design extends beyond photorealistic heads to stylized avatars, but with different assumptions about landmarks and geometry.

4. Full-body generation, personalization, and editability

AvatarBooth is one of the clearest full-body precursors to the broader StableAvatar concept (Zeng et al., 2023). It generates high-quality 3D human avatars from text prompts or specific images and addresses stability by separating facial identity from body appearance. The representation is NeuS, initialized from an SMPL shape, and optimization uses SDS over three rendered modalities—color, texture-less geometry shading, and normals—sampled with ratio 1:1:8 (Zeng et al., 2023).

The central architectural move is the use of dual fine-tuned diffusion models. A body model, $D_{\text{body}}$ , is DreamBooth-fine-tuned on full-body photos; a face model, $D_{\text{face}}$ , is trained in two stages, first for 900 iterations to generate multi-view head images with ControlNet(OpenPose), then for 500 iterations on the union of real headshots and the generated multi-view images (Zeng et al., 2023). During SDS optimization, face-centered renders use the face model and body-centered renders use the body model, with a camera sampling ratio of 25% face-centered and 75% body-centered (Zeng et al., 2023). This division of labor is combined with a multi-resolution supervision schedule that upsamples renders to 512, 640, and 768 over 2000, 2000, and 4000 steps, for a total of 8000 optimization steps (Zeng et al., 2023).

AvatarBooth is also one of the first systems in this set to emphasize editability as part of stability. The resulting avatar can be edited with prompts such as “[V] man with yellow hair,” “[V] wearing like a wizard,” or “[V] in his fifties,” and driven by motion sequences through an SMPL skeleton and linear blend skinning (Zeng et al., 2023). The paper reports that, in a user study with 30 volunteers and 10 avatars per method, AvatarBooth achieved the highest average scores for correspondence with text, appearance quality, geometry quality, and face fidelity relative to CLIP-Actor, AvatarCLIP, and TEXTure, although exact numbers are not provided (Zeng et al., 2023).

This line of work establishes a recurring definition of StableAvatar in the full-body setting: a personalized model must tolerate uncontrolled capture conditions, maintain face fidelity without sacrificing clothing detail, and remain editable and riggable. Multi-resolution SDS, pose-consistent ControlNet augmentation, and the separation of face and body supervision are the mechanisms used to make that definition operational (Zeng et al., 2023).

5. Stable driving, reenactment, and deformation control

Once avatar geometry has been reconstructed, stability becomes a question of how that representation is driven. Three recent systems give particularly sharp answers.

TexAvatars introduces a hybrid texel–3D representation for photorealistic Gaussian head avatars under extreme reenactment (Lee et al., 24 Dec 2025). It predicts local Gaussian attributes in UV space with CNNs, but lifts them into global 3D using mesh-aware Jacobians derived from a tracked FLAME mesh. The analytic rigging equations are

$\Sigma = J\Sigma^\ell J^\top, \qquad \mu = J\mu^\ell + T.$

Its distinctive contribution is the Quasi-Phong Jacobian Field, in which triangle-wise Jacobians are unwrapped into UV space and then bilinearly resampled with align_corners=False to remove piecewise discontinuities across triangle boundaries (Lee et al., 24 Dec 2025). On NeRSemble, the method reports Novel Expression (Held-out) LPIPS 0.048 ± 0.013, SSIM 0.894 ± 0.030, PSNR 25.61 ± 2.10; Novel Expression (FREE) LPIPS 0.077 ± 0.017, SSIM 0.861 ± 0.033, PSNR 22.84 ± 2.05; and Novel View LPIPS 0.030 ± 0.005, SSIM 0.947 ± 0.013, PSNR 35.15 ± 1.33 (Lee et al., 24 Dec 2025). Here stability is explicitly tied to bounded gradient propagation, continuous Jacobian interpolation, and resistance to out-of-distribution expressions and poses.

Autoregressive Appearance Prediction for 3D Gaussian Avatars addresses a different instability: abrupt appearance changes caused by ambiguous pose-to-appearance mappings in long-form captures (Steiner et al., 1 Apr 2026). The method learns local appearance latents from per-frame UV textures using a VAE-style encoder and then predicts them autoregressively at driving time with a causal transformer that takes previous latents together with pose, velocity, and acceleration histories (Steiner et al., 1 Apr 2026). The reported mean test performance is 32.99 PSNR, 0.939 SSIM, and 0.063 LPIPS, compared with 32.34/0.937/0.066 for MMLPs† and 32.74/0.939/0.065 for nRFGCA†; predictor re-initialization every 30 frames improves test PSNR to 33.85, while using the encoder at test gives an upper bound of 34.78 (Steiner et al., 1 Apr 2026). In this formulation, stability is temporal smoothness in the learned appearance manifold rather than rigidity in geometry alone.

JacobianAvatar treats stability as integrability and temporal coherence in semi-rigid deformation from monocular video (Won et al., 30 Jun 2026). It predicts pose-conditioned neural Jacobian fields on a canonical mesh, integrates them with a screened Poisson solver, regularizes the result with SDF-based normal alignment, and adds a deformation-guided residual flow loss to tie 3D motion to 2D correspondences across adjacent frames (Won et al., 30 Jun 2026). The screened Poisson problem is formulated as

$E(\Phi) = \int_\Omega \|\nabla\Phi(x)-J(x)\|^2\,dx + \lambda_\Phi \int_\Omega \|\Phi(x)-x\|^2\,dx,$

yielding a soft anchor toward the canonical shape in poorly observed regions (Won et al., 30 Jun 2026). On MonoPerfCap, DNA-Rendering, and SynWild, the method reports 31.83 PSNR, 0.978 SSIM, 1.67 LPIPS; 29.90, 0.971, 2.19; and geometry metrics CD 2.46, NE 0.091, F1@1cm 0.397, F1@2cm 0.681, respectively (Won et al., 30 Jun 2026). Its ablations show that removing $w(\tau) = \frac{\ln\big(1 + u\cdot(\mathrm{e}-1)\big)}{\ln(\mathrm{e})}, \quad u=\frac{\tau}{m-1}.$ 0, residual flow, or screened Poisson degrades CD and NE and produces severe noise or stretching (Won et al., 30 Jun 2026).

A concise comparison of these stability mechanisms is useful:

System	Representation	Primary stability mechanism
StableAvatar	Video DiT in latent space	Time-step-aware Audio Adapter, Audio Native Guidance, DWSW
Arc2Avatar	FLAME-anchored 3DGS	ArcFace-conditioned diffusion prior, ISM, masked facial splats
AvatarBooth	NeuS + SMPL	Dual diffusion heads, pose-consistent ControlNet, multi-resolution SDS
TexAvatars	Hybrid texel–3DGS	Mesh-aware Jacobians, Quasi-Phong Jacobian Field
AAP-3DGA	Spatial-MLP 3DGS	Autoregressive local appearance latent prediction
JacobianAvatar	Mesh + mesh-anchored 3DGS	Neural Jacobian fields, screened Poisson, residual flow

This comparison suggests that StableAvatar, as a research concept, is not tied to a single representation. The common structure is the insertion of explicit mechanisms that suppress drift: conditioning-aware modules in diffusion, template- or mesh-aware geometric anchoring in 3D, and causal or residual-motion modeling for temporal continuity.

6. Limitations and open directions

The literature is consistent in showing that stability remains conditional rather than absolute. StableAvatar itself reports failure on non-human references with extreme morphology, degradation under noisy or misaligned audio, and nontrivial long-horizon compute costs despite being 10× faster and using about 50% memory relative to OmniAvatar in a given setting (Tu et al., 11 Aug 2025). Arc2Avatar reports occasional unintended expressions during neutral-phase optimization, ear artifacts, edge blurring, difficulty with hair, accessories, and extreme occlusions, and rare identity drifts when view or LoRA weights are mis-set (Gerogiannis et al., 9 Jan 2025). AvatarBooth identifies identity leakage under extremely sparse or occluded personalization sets, imperfect hair and accessories, artifacts with loose or non-manifold clothing, and instability in high-resolution SDS without the proposed schedule (Zeng et al., 2023).

The reenactment literature adds another layer of limitations. TexAvatars does not fully model dynamic hair strands, fully articulated tongue motion, or explicit specular glints, and its fixed Gaussian count can miss micro-detail such as pores or facial hair (Lee et al., 24 Dec 2025). AAP-3DGA notes that stronger pose–appearance disentanglement worsened reconstruction quality, that strict locality misses cast shadows and other long-range interactions, and that zero-latent bootstrap imposes a short warm-up ramp at test time (Steiner et al., 1 Apr 2026). JacobianAvatar remains limited by inherited naked-body skinning weights for loose garments, does not model facial expressions or hands in its presented form, and can still degrade under very fast motion or extreme occlusion (Won et al., 30 Jun 2026).

A plausible implication is that future StableAvatar systems will continue to hybridize mechanisms rather than converge on a single canonical architecture. The evidence across these papers favors designs that combine strong conditioning priors, explicit geometric structure, and dedicated temporal controls: ArcFace-conditioned diffusion with masked template-aligned 3DGS (Gerogiannis et al., 9 Jan 2025); dual face/body diffusion supervision with pose-consistent ControlNet (Zeng et al., 2023); Jacobian-based deformation fields and screened integration (Won et al., 30 Jun 2026); and autoregressive latent dynamics for appearance smoothness (Steiner et al., 1 Apr 2026). The term “StableAvatar” therefore increasingly denotes a systems criterion—stable identity, stable color, stable deformation, and stable temporal behavior—rather than a single model family.