Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leveraging Topics and Audio Features with Multimodal Attention for Audio Visual Scene-Aware Dialog (1912.10131v1)

Published 20 Dec 2019 in cs.MM, cs.CL, cs.SD, and eess.AS

Abstract: With the recent advancements in AI, Intelligent Virtual Assistants (IVA) such as Alexa, Google Home, etc., have become a ubiquitous part of many homes. Currently, such IVAs are mostly audio-based, but going forward, we are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances. This will enable agents to have conversations with users about the objects, activities and events surrounding them. In this work, we present three main architectural explorations for the Audio Visual Scene-Aware Dialog (AVSD): 1) investigating `topics' of the dialog as an important contextual feature for the conversation, 2) exploring several multimodal attention mechanisms during response generation, 3) incorporating an end-to-end audio classification ConvNet, AclNet, into our architecture. We discuss detailed analysis of the experimental results and show that our model variations outperform the baseline system presented for the AVSD task.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Eda Okur (20 papers)
  2. Saurav Sahay (34 papers)
  3. Jonathan Huang (46 papers)
  4. Lama Nachman (27 papers)
  5. Shachi H Kumar (17 papers)
Citations (7)