Papers
Topics
Authors
Recent
2000 character limit reached

DyStream: Real-Time Dyadic Talking-Head Synthesis

Updated 3 January 2026
  • DyStream is a streaming dyadic talking-head system that employs flow-matching autoregression and causal audio-to-motion modeling for real-time interactive visuals.
  • It achieves state-of-the-art lip sync and non-verbal realism with under 100 ms latency through a dual-stage architecture and optimized audio lookahead.
  • The system demonstrates high performance with rapid FPS rates and significant improvements over traditional chunk-based methods, enabling natural turn-taking in conversational agents.

DyStream is a streaming dyadic talking-head video generation system designed for ultra-low latency real-time applications. It leverages a flow matching-based autoregressive architecture with causal audio-to-motion modeling, enabling dynamic video synthesis driven by both speaker and listener audio streams. DyStream achieves state-of-the-art lip-sync and dyadic non-verbal realism at under 100 ms end-to-end latency, facilitating interactive agents that require immediate visual feedback and turn-taking behaviors (Chen et al., 30 Dec 2025).

1. Motivation and Problem Setting

Realistic conversational agents necessitate immediate non-verbal visual feedback—such as head gestures, nods, and facial expressions—by both speaker and listener. Traditional chunk-based talking-head generators (e.g., VASA, INFP) introduce substantial delays since they require entire audio context windows (e.g., 0.96 s) before video synthesis. This non-causal processing results in fundamental buffering delays (hundreds of milliseconds), undermining natural conversational flow and rendering timely listener responses impossible. Purely causal frame-wise models also have limitations, notably the inability to capture anticipatory co-articulation: lip shapes and facial gestures often begin to form several tens of milliseconds before the associated phoneme. DyStream was developed to address these deficiencies by enabling true streaming video generation with both causal and limited lookahead audio encoding.

2. System Architecture

DyStream is organized in two distinct stages:

  • Stage 1: Motion-Aware Autoencoder
    • An image VAE (adapted from LIA/LivePortrait) disentangles identity-preserving appearance and motion.
    • Reference image IsI_s yields an appearance code vapp=Eapp(Is)\mathbf{v}_{\mathrm{app}} = \mathcal{E}_{\mathrm{app}}(I_s) and initial motion latent ms=Em(Is)\mathbf{m}_s = \mathcal{E}_m(I_s).
    • During training, driving video frames yield motion latents mdri1:N=Em(Vdri)\mathbf{m}^{1:N}_{\mathrm{dri}} = \mathcal{E}_m(V_{\mathrm{dri}}).
    • A flow estimator F\mathcal{F} predicts dense flows Fs→d\mathbf{F}_{s\rightarrow d}, warping vapp\mathbf{v}_{\mathrm{app}} for video reconstruction via decoder Dvae\mathcal{D}_{\mathrm{vae}}.
  • Stage 2: Streaming Audio-to-Motion via Flow-Matching Autoregression
    • Real-time inference uses two continuous audio tracks (speaker, listener), processed by a custom causal Wav2Vec2 encoder enhanced with Rotary Positional Embeddings and a controlled lookahead attention mask (up to 60 ms).
    • Frame-wise interpolated audio features condition an autoregressive Transformer (12 causal attention + MLP blocks).
    • Audio-conditional features (ctc_t) are injected into a flow-matching head (six MLP blocks) via AdaLN normalization, defining the probability distribution for future motion latents.
    • Sampled motion latent mt\mathbf{m}_t is used to warp the appearance code for frame synthesis and provides context for subsequent steps via a sliding window with anchor conditioning, mitigating pose drift.
    • Listener audio encoding uses a purely causal mask (0 ms lookahead) in accordance with natural human responsiveness.

3. Mathematical Underpinnings

  • Autoregressive Audio-to-Motion Modeling
    • At each time step tt, audio-conditioned history is represented by

    ct=ARθ(m<t, a≤t)\mathbf{c}_t = \mathrm{AR}_\theta(\mathbf{m}_{<t},\,\mathbf{a}_{\le t})

    where at\mathbf{a}_t is the dyadic audio feature, and mt\mathbf{m}_t is a sample from the conditional density

    pθ(mt∣m<t, a≤t)p_\theta(\mathbf{m}_t \mid \mathbf{m}_{<t},\,\mathbf{a}_{\le t})

  • Flow Matching (Stochastic Modeling)

    • Motion latent mt\mathbf{m}_t is constructed via noising at a random continuous timestep t∼U[0,1]t \sim \mathcal{U}[0, 1]:

    mt=(1−σt) m0+σt ϵ,ϵ∼N(0,I)\mathbf{m}_t = (1-\sigma_t)\,\mathbf{m}_0 + \sigma_t\,\boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0}, I) - The denoiser Dθ\mathcal{D}_\theta is trained to minimize reconstruction loss:

    Lflow=Et, ϵ∥Dθ(mt, t, ct)−m0∥2\mathcal{L}_{\mathrm{flow}} = \mathbb{E}_{t,\,\boldsymbol{\epsilon}}\|\mathcal{D}_\theta(\mathbf{m}_t,\,t,\,\mathbf{c}_t) - \mathbf{m}_0\|^2 - Inference proceeds via integration of the learned ODE:

    dm(τ)dτ=vθ(m(τ), τ, ct)\frac{d\mathbf{m}(\tau)}{d\tau} = v_\theta(\mathbf{m}(\tau),\,\tau,\,\mathbf{c}_t)

    (5-step Euler solver). - Classifier-Free Guidance (CFG) refines prediction using weighted speaker, listener, anchor, and combined conditions.

  • Regularization Strategies

    • Anchor conditioning (random anchor during training, fixed anchor at inference) for pose stabilization.
    • Dropout on audio (0.5) and anchor (0.1) inputs for training stability.

4. Latency and Real-Time Constraints

  • DyStream achieves 29.24 FPS (34 ms per frame) with a 5-step Euler sampler and 38.46 FPS (26 ms per frame) with a 1-step sampler on NVIDIA H200 hardware.
  • The cumulative system latency—including lookahead (60 ms) and frame generation—remains under 100 ms, satisfying practical real-time criteria for conversational agents.
  • Lookahead trade-off (Table 5 of (Chen et al., 30 Dec 2025)):
Lookahead (ms) Sync-C Score
0 2.86
20 5.98
40 6.90
60 7.39
80 7.67

A 60 ms lookahead is optimal, balancing latency and lip-sync accuracy.

5. Evaluation and Comparative Performance

  • Speaker Mode (HDTF-100, offline Sync-C):
    • DyStream: 8.136
    • DyStream w/o flow head: 7.867 (−3.3%)
    • DyStream w/o frame-wise addition: 7.660
    • DyStream (online, 60 ms): 7.61
    • Sonic: 8.495
    • SadTalker: 6.704
    • Hallo3: 6.814
    • INFP*: 6.894
    • Real3DPortrait: 6.811
  • Latency Sensitivity (INFP-I, chunk-based at HDTF):
    • 40 ms: 1.718
    • 80 ms: 3.112
    • 160 ms: 5.335
    • 320 ms: 6.637
  • Listener Mode (RealTalk, Dyadic Reactions):
    • DyStream FD (Exp/Pose): 0.074 / 3.192 vs INFP 0.141 / 3.158
    • MSE (Exp/Pose): 0.018 / 1.636 vs 0.019 / 1.286
    • Shannon Index SID (Exp/Pose): 4.586 / 3.070 vs 4.263 / 2.562
    • Variance Var (Exp/Pose): 0.275 / 0.596 vs 0.185 / 0.200

DyStream demonstrates significant improvement in listener motion diversity and realism, with ablation studies confirming the necessity of flow matching heads and anchor conditioning.

6. Model Behavior and Limitations

DyStream produces identity-consistent, visually plausible talking heads with synchronized lip movement and diverse dyadic non-verbal expressions. It supports a broad range of reference poses and enables natural agent turn-taking.

Documented limitations include rigid deformation artifacts in accessories (glasses, jewelry) due to warp-based decoding, absence of modeling for occlusions (e.g., hands over face), requirement for pre-aligned/cropped face inputs, and restricted long-term diversity beyond the fixed length of the AR sliding window. No explicit mechanism is incorporated for handling extended sequences or occlusion effects; further research could investigate long-sequence training and augments for occlusion modeling.

7. Context and Prospects

DyStream advances the field of low-latency, dyadic talking-head synthesis for interactive agents by integrating probabilistic flow-matching, stream-friendly autoregression, and controlled audio lookahead. Its practical design and open release of model, weights, and code place it as a benchmark for future work in real-time conversational video generation (Chen et al., 30 Dec 2025). A plausible implication is increased deployment potential for virtual agents in synchronous communications, telepresence, and adaptive UX contexts where instantaneous feedback is critical. Future directions may involve addressing the noted limitations and expanding application across varied conversational domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to DyStream.