Papers
Topics
Authors
Recent
Search
2000 character limit reached

ViDA-MAN: Visual Dialog with Digital Humans

Published 26 Oct 2021 in cs.CV | (2110.13384v1)

Abstract: We demonstrate ViDA-MAN, a digital-human agent for multi-modal interaction, which offers realtime audio-visual responses to instant speech inquiries. Compared to traditional text or voice-based system, ViDA-MAN offers human-like interactions (e.g, vivid voice, natural facial expression and body gestures). Given a speech request, the demonstration is able to response with high quality videos in sub-second latency. To deliver immersive user experience, ViDA-MAN seamlessly integrates multi-modal techniques including Acoustic Speech Recognition (ASR), multi-turn dialog, Text To Speech (TTS), talking heads video generation. Backed with large knowledge base, ViDA-MAN is able to chat with users on a number of topics including chit-chat, weather, device control, News recommendations, booking hotels, as well as answering questions via structured knowledge.

Citations (4)

Summary

  • The paper introduces a multimodal system that integrates ASR, TTS, and real-time video generation to create human-like digital interactions.
  • It demonstrates sub-second latency performance, ensuring immediate and immersive responses during natural conversations.
  • The system’s versatility across various topics highlights its potential for applications in virtual assistance and enhanced digital human interaction.

The paper "ViDA-MAN: Visual Dialog with Digital Humans" introduces a system designed for enhancing interactions with digital humans through multi-modal communication. ViDA-MAN focuses on creating more human-like and immersive experiences by providing real-time audio-visual responses to spoken inquiries.

Key Features of ViDA-MAN:

  1. Multimodal Interaction:
    • ViDA-MAN integrates several technologies to allow for seamless interaction, including Acoustic Speech Recognition (ASR), multi-turn dialog systems, Text To Speech (TTS), and talking heads video generation.
    • These components work together to interpret speech and generate corresponding facial expressions and gestures, providing a more engaging user experience than traditional text or voice-based systems.
  2. Real-Time Performance:
    • The system performs with sub-second latency, meaning it can process speech requests and respond with high-quality video almost instantaneously. This rapid response time is crucial for maintaining an immersive user experience.
  3. Human-Like Interaction Capabilities:
    • By generating vivid voices and natural facial expressions, ViDA-MAN can mimic human interactions effectively. This includes not only casual conversations but also expressive body language, which contributes to a more genuine dialog with users.
  4. Wide Range of Topics:
    • Supported by a comprehensive knowledge base, ViDA-MAN is capable of engaging in diverse topics such as chit-chat, weather updates, device control, news recommendations, and hotel bookings. It can also answer structured inquiries, enhancing its utility as an informative agent.
  5. Application Potential:
    • The approach taken by ViDA-MAN suggests it can be used in various contexts where natural human-interaction is desired with digital agents, expanding possibilities for virtual assistance across different industries.

This work represents a significant advancement in the field of digital human interaction, offering a sophisticated platform for multimodal communication that closely mimics natural human behaviors.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.