Papers
Topics
Authors
Recent
Search
2000 character limit reached

Real-time Audiovisual Dialogue

Updated 27 May 2026
  • Real-time audiovisual dialogue systems are multimodal interaction platforms that integrate live audio, video, and behavioral signals to deliver synchronized and natural conversations.
  • They employ advanced feature extraction and fusion techniques, using tools like I3D, Wav2Vec 2.0, and transformer models to align audio and visual data in real time.
  • These systems address challenges such as fine-grained temporal synchronization and multi-speaker interference, enabling applications from VR communication to telepresence avatars.

Real-time audiovisual dialogue refers to the class of dialogue systems that integrate audio, visual, and often multimodal behavioral cues to enable agents—human or artificial—to engage in natural, temporally synchronized conversation, typically with very low (sub-second) algorithmic latency. These systems must ingest, align, and fuse live speech, facial or full-body signals, and semantic context to produce contextually and visually coherent responses as conversation unfolds. This interactional setting is central to applications ranging from embodied AI agents and telepresence avatars to immersive VR/AR communication and next-generation human–machine interfaces.

1. Task Settings and Datasets

The prototypical real-time audiovisual dialogue system accepts temporally aligned audio (waveform or tokenized), visual streams (RGB, pose, or 3D mesh), and dialogue history as input, and generates token-level replies (text, speech, or multimodal actions) under stringent latency constraints. Early work operationalizes this in the Audio-Visual Scene-Aware Dialog (AVSD) task: at each round tt, the system receives a video VV (≈30 s clip), audio AA, a flattened dialog history DHt=(S,Q0,A0,...,Qt1,At1)DH_t = (S, Q_0, A_0, ..., Q_{t-1}, A_{t-1}), and a current question QtQ_t, and must output a single language answer AtA_t grounded in all available modalities (Alamri et al., 2019).

Key AVSD dataset statistics:

  • 11,816 Charades video clips;
  • One 10-turn Q&A dialog + summary per video, totaling 118,160 QA pairs;
  • Average question length ≈7.9 words, answer ≈9.4 words;
  • Over 70% of dialogs are temporally grounded, ≈57% reference audio.

Modern datasets extend this to 3D dyadic conversational pairs (Shan et al., 9 Mar 2026), emotionally annotated multiparty video chat (Park et al., 2024), and user-driven multimodal themes (Wang et al., 31 Jan 2025). Benchmarks include MultiDialog, RAVDESS, CREMA-D, BEAT v2, and HumanML3D.

2. Multimodal Feature Extraction and Fusion

Robust real-time fusion mandates extracting temporally aligned embeddings:

Fusion mechanisms span late concatenation and fully connected (FC) layers (baseline AVSD) (Alamri et al., 2019), multimodal attention (modality-specific self- and cross-attention in matrices), and learnable gates or transformer-based fusion (Chen et al., 14 Nov 2025, Park et al., 2024). In dual-speaker animation tasks, dual-stream architectures with inter-speaker cross-attention explicitly account for dynamic roles and interaction (Shan et al., 9 Mar 2026).

3. Dialogue Generation, Speaker Modeling, and Turn Management

Dialogue generation frameworks employ multiple strategies:

  • Baseline: Candidate ranking via inner product between multimodal scene-aware embedding e\mathbf{e} and candidate answer embedding at,i\mathbf{a}_{t,i}, normalized via softmax (Alamri et al., 2019).
  • Streaming: Causal transformer decoders ingest fused audio-visual and prior token signals, offering interleaved prediction for ASR tokens (Uₙ) and turn-events (Tₙ) (Chen et al., 14 Nov 2025).
  • Instruction/fine-grained: Chain-of-thought (CoT) scaffolding with explicit internal plans, then sequential decoding of text, speech, and gesture tokens in lock-step for contextual alignment (Deng et al., 27 Feb 2026).

Speaker tracking is realized through synchronizing detected face tracks, lip crops, or pose with audio codes (preserving timbre)—the transformer backbone can then focus on speaker-coherent segments, especially under noisy, multi-talker conditions (Chen et al., 14 Nov 2025). Turn-taking triggers are modeled as independent token streams (<SOT>, <SOB>, <EMP>) and optimized via weighted cross-entropy, enabling explicit interruptibility and natural floor management.

Theme- and emotion-aware systems augment the above with explicit role cards, dialogue memory (for each agent), and emotion tagging at each response, leveraging LLM planners to ensure both thematic and behavioral consistency aligned to visual context (Park et al., 2024, Wang et al., 31 Jan 2025).

4. Real-Time Processing and System Optimization

To ensure low-latency operation, real-time audiovisual dialogue pipelines adopt:

Empirical latencies for state-of-the-art systems:

5. Evaluation Protocols, Metrics, and Comparative Results

Evaluation is multi-granular, reflecting retrieval, generation, and subjective aspects:

Metric Usage/Task Description / Units
Recall@k, MRR, Mean Rank Retrieval (AVSD, ranking) Fraction of correct response in top-k; mean inverse rank (Alamri et al., 2019)
BLEU, METEOR, ROUGE-L Generation (text/summary) N-gram overlap, F-measure, recall, precision
FD, P-FD, vMSE, SID 3D animation (dyadic or single) Distributional, parameter, and diversity metrics
WER Streaming transcription Word Error Rate in %
FTO, N-MOS, H-MOS Turn-taking, naturalness, helpfulness Floor-Transfer-Offset (timing), Mean Opinion Score
Production Quality, Lip Sync AV-generation, dialogue video Human and automated assessment of rendering and AV alignment (Pang et al., 2 Dec 2025)
EmoBERTScore, GPT-4 fluency Emotional and context appropriateness Classifier and model-based subjective scoring (Park et al., 2024)

Exemplary comparative results:

  • AV-EmoDialog achieves BLEU-4 = 0.0307, EmoBERT = 0.30, Dist-1 = 0.90 (outperforming LLaVA-NeXT-vid and Qwen-Audio+LLM) (Park et al., 2024).
  • U-Mind reports Fréchet Gesture Distance (FGD) = 7.67 and LLM-judged naturalness = 8.11/10 (in multimodal dialogue) (Deng et al., 27 Feb 2026).
  • "Talking Together" 3D dyadic avatars: FD = 10.43 (vs. baseline 28.4), vMSE_speaker = 7.99, vMSE_listener = 2.29, human preference 70–80% for all criteria (Shan et al., 9 Mar 2026).

6. System Designs: Architectures, Training, and Coordination

A spectrum of architectural approaches is found:

  • Modular late-fusion with LSTM/Transformer for separate modalities, plus discriminative ranking (Alamri et al., 2019);
  • Dual-stream U-Nets with cross-attention to model co-located dyadic behavior, augmented with speaker-role embedding and gaze minimization (Shan et al., 9 Mar 2026);
  • Conductor–Creator decomposition (explicit understanding vs. generation) for AR speech and diffusion video rendering, with cross-clip/multimodal attention for coherence (Pang et al., 2 Dec 2025);
  • End-to-end UARF alignment: tokenizing text, speech, and motion into a unified embedding flow, with chain-of-thought planning and prosodic segment randomization for cross-modal grounding (Deng et al., 27 Feb 2026);
  • Multi-agent LLM–VLM–ASR co-agent dialogue with theme and visual consistency self-correction, enabling both role separation and real-time self-corrective feedback (Wang et al., 31 Jan 2025).

Training is typically curriculum-based:

7. Challenges, Developments, and Directions

Key technical challenges for real-time audiovisual dialogue include:

  • Fine-grained temporal alignment across modalities (requiring e.g., segment-wise shuffling or attention-based fusion);
  • Robustness under multi-speaker interference, noise, and diverse behavior (addressed via acoustic tokenization and visual tracking);
  • Cross-modal synchrony for long-duration content, preventing identity or timbre collapse (necessitating recurrent fusion and autoregressive, not segmented, decoding) (Pang et al., 2 Dec 2025);
  • Incorporating emotional nuance, theme control, and context adaptation in the loop (via explicit Lambda-weighted losses and multi-agent self-correction).

Recent work demonstrates that, with optimized fusion and decoding, real-time systems can now achieve both high-quality, contextually grounded interaction and maintain algorithmic latency below 400 ms per conversational turn, even with complex generation backends and emotion–theme conditioning (Park et al., 2024, Chen et al., 14 Nov 2025, Pang et al., 2 Dec 2025).

A plausible implication is that advances in segment alignment, unified embedding spaces, and explicit planning (chain-of-thought) have been determinative in closing the gap between open-domain conversational fluency and perceptually grounded, low-latency multimodal generation required for immersive and embodied real-time dialogue agents.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Real-time Audiovisual Dialogue.