JoyAvatar: Real-time Infinite-Length Avatar Synthesis

Updated 19 December 2025

JoyAvatar is an audio-driven autoregressive generative model that produces infinite-length avatar videos in real time using a block-wise diffusion Transformer framework.
It incorporates Progressive Step Bootstrapping and Motion Condition Injection to stabilize early generation and enhance temporal coherence, reducing error accumulation and block-boundary artifacts.
Enhanced by Unbounded Rotary Position Embeddings via Cache-Resetting and optimized with audio and identity embeddings, JoyAvatar outperforms previous methods in lip-sync accuracy, temporal consistency, and visual quality.

JoyAvatar is an audio-driven autoregressive generative model for real-time and infinite-length avatar video synthesis using block-wise diffusion Transformers. It is designed to address computational and quality limitations of prior diffusion Transformer (DiT) methods and autoregressive approaches in audio-driven avatar generation. Key innovations include Progressive Step Bootstrapping (PSB) for improved stability, Motion Condition Injection (MCI) for enhanced temporal coherence, and Unbounded Rotary Position Embeddings via Cache-Resetting (URCR) enabling video generation of arbitrary duration. The architecture is optimized for high performance, achieving real-time synthesis speeds, and is benchmarked on custom out-of-distribution (OOD) test sets for metrics such as lip-sync, temporal consistency, and visual fidelity (Li et al., 12 Dec 2025).

1. Block-wise Autoregressive Diffusion Framework

JoyAvatar generates video as a sequence of fixed-length, non-overlapping blocks of latent frames, with each block autoregressively conditioned on all prior blocks and audio input. Given latent frames $x^1,\ldots,x^N$ , the model factorizes the video likelihood as:

$p(x^{1:N}) = \prod_{i} p(x^i \mid x^{1:i-1}, a)$

where $a$ represents the audio driving features. For each block, a few-step denoising DiT instantiates $p(x^i \mid \cdot)$ :

The block’s latents are initially corrupted with Gaussian noise at the maximum noise level.
$T_i$ $T_{i}$ denoising steps are performed to recover the clean data, conditioning on:
- KV (Key/Value) cache of prior block representations (including motion conditions, see MCI).
- Cross-attention to projected Wav2Vec audio embeddings.
- Cross-attention to static ArcFace-based identity embeddings.
After denoising, the generated block is appended to the cache, and the cache is truncated to a sliding window (e.g., 4 blocks) to bound computation. Generation continues sequentially in this manner, supporting infinite-length decoding (Li et al., 12 Dec 2025).

2. Progressive Step Bootstrapping (PSB)

PSB addresses error accumulation during autoregressive roll-outs by allocating more denoising steps to the initial blocks, which serve as critical context. Training is performed with a window of 4 blocks:

Main denoising steps: noise timesteps at [1000, 750, 500, 250].
An auxiliary “sub-step” branch (midpoints [875, 625, 375, 125]) is stochastically activated during training, supplying denser learning signals for early blocks via optional extra supervision.
During inference, the number of denoising steps $T_{i}$ per block is {8, 7, 6, 5} for blocks 1–4, reverting to the baseline 4 steps for $i > 4$ .

Pseudocode (inference schedule):

for i in 1…block_count:
    if i <= 4:
        Ti = 9 - i    # yields 8,7,6,5
    else:
        Ti = 4
    noisy_block = sample_noise()
    for t in denoise_schedule[1:Ti]:
        noisy_block = DiT_denoise(noisy_block, t, context=KV_cache, audio, id)
    append_block_to_cache(noisy_block)

PSB results in stabilized early generation, reducing the compounding of noise and enhancing output consistency (Li et al., 12 Dec 2025).

3. Motion Condition Injection (MCI)

MCI is designed to mitigate block-boundary flicker artifacts and improve temporal coherence. For each block $i$ :

The last clean latent $y_{i-1}$ from the preceding block is used to construct a motion condition frame at the start of block %%%%9%%%% denoising, for noise timestep $t_1$ :

$c = \sqrt{\bar{\alpha}_{t_1}}\, y_{i-1} + \sqrt{1-\bar{\alpha}_{t_1}}\, \epsilon,\, \epsilon \sim \mathcal{N}(0, I)$

where $\bar{\alpha}_{t}$ is the cumulative noise factor.

This noisy frame $c$ is prepended to the KV-cache, enabling the model to attend to both the actual historical latent frames and this noise-corrupted condition.
During training, MCI is randomly applied with probability $p=0.7$ ; at inference, it is always applied (Li et al., 12 Dec 2025).

4. Unbounded RoPE via Cache-Resetting (URCR)

Standard Rotary Position Embeddings (RoPE) are limited by fixed positional index capacity, leading to overflow or degradation beyond a sequence length threshold (e.g., 1024 tokens). URCR resolves this, supporting infinite-length video synthesis by:

“Sink frames” (first block) anchoring global reference via relative embeddings.
KV states are cached before RoPE application; at each attention invocation, RoPE is dynamically re-applied using the true current block position.
When the block index $I > N$ (e.g., $N=27$ ): sink frame indices are reset to zero; subsequent relative positions are renumbered such that no index ever exceeds $N$ .

This dynamic resetting maintains all positional indices within a bounded window, preventing index overflow and preserving model performance for arbitrarily long sequences (Li et al., 12 Dec 2025).

5. Model Implementation and Quantitative Performance

Major implementation parameters include:

Model: Causal student, 1.3B parameters; audio-driven bidirectional teacher, 14B parameters.
Input preprocessing uses Wav2Vec 1.0 for audio features and ArcFace for identity embeddings, with both projected into the DiT via cross-attention.
Latent frame organization: 3 per block, 4 blocks in window (total up to 27 latents at training, with 3 sink frames).
Diffusion schedule: 1000, 750, 500, 250, 875, 625, 375, 125, AdamW optimizer ( $\beta_1=0.9, \beta_2=0.999$ ), learning rates $2 \times 10^{-6}$ (generator), $4 \times 10^{-7}$ (fake score).
Distributed training: 64 GPUs, window size 4, 10k steps, batch size 1 per GPU.

Quantitative results on a custom OOD test set (100 avatars):

Metric	Value	Interpretation
Sync-C (↑)	5.25	Lip sync (higher better)
Sync-D (↓)	8.30	Lip sync deviation (lower better)
IDC (↑)	0.72	Temporal consistency
Q_score (↑)	0.84	Visual quality
Q_face (↑)	0.53	Facial fidelity
Inference FPS	16.2 (DiT)	Real-time speed
Inference FPS (E2E)	>30 (multi-GPU)	Streaming speed

JoyAvatar outperforms causal DiT models such as LiveAvatar 14B (4.3 FPS) on lip-sync and temporal metrics while operating approximately 4× faster (Li et al., 12 Dec 2025).

6. Known Limitations and Prospective Developments

The 1.3B parameter causal student model is capacity-limited for scenes with extreme facial details or complexity, notably in close-up generation.
Extended roll-outs (many minutes) reveal gradual drift in facial appearance, and highly expressive avatars may amplify these effects.
Rapid head movements can occasionally produce localized motion blur artifacts.
Ongoing research includes scaling the generator to 5–10B parameters for fine-grained detail, refining controls for semantic attributes (e.g., emotion, gaze), and adopting adaptive denoising step allocations based on online error tracking (Li et al., 12 Dec 2025).

Markdown Upgrade to Chat

References (1)

JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to JoyAvatar.