HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue (2312.09736v1)
Abstract: Video-grounded Dialogue (VGD) aims to answer questions regarding a given multi-modal input comprising video, audio, and dialogue history. Although there have been numerous efforts in developing VGD systems to improve the quality of their responses, existing systems are competent only to incorporate the information in the video and text and tend to struggle in extracting the necessary information from the audio when generating appropriate responses to the question. The VGD system seems to be deaf, and thus, we coin this symptom of current systems' ignoring audio data as a deaf response. To overcome the deaf response problem, Hearing Enhanced Audio Response (HEAR) framework is proposed to perform sensible listening by selectively attending to audio whenever the question requires it. The HEAR framework enhances the accuracy and audibility of VGD systems in a model-agnostic manner. HEAR is validated on VGD datasets (i.e., AVSD@DSTC7 and AVSD@DSTC8) and shows effectiveness with various VGD systems.
- Audio visual scene-aware dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7558–7567.
- Vqa: Visual question answering. In The IEEE International Conference on Computer Vision (ICCV).
- Layer normalization. arXiv preprint arXiv:1607.06450.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460.
- Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
- Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308.
- Multi-step joint-modality attention network for scene-aware dialogue system. In DSTC8 at AAAI2020 workshop.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- John S Garofolo. 1993. Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE.
- Dynamic graph representation learning for video dialog via multi-modal shuffled transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1415–1423.
- Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131–135. IEEE.
- Audio visual scene-aware dialog (avsd) track for natural language generation in dstc8. In AAAI-DSTC8 Workshop.
- End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
- Joint student-teacher learning for audio-visual scene-aware dialog. In Proceedings of the Interspeech.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
- Structured co-reference graph attention for video-grounded dialogue. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1789–1797.
- Vgnmn: Video-grounded neural module networks for video-grounded dialogue systems. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3377–3393.
- Hung Le and Nancy F. Chen. 2020. Multimodal transformer with pointer network for the dstc8. In DSTC8 at AAAI2020 workshop.
- Learning reasoning paths over semantic graphs for video-grounded dialogues. arXiv preprint arXiv:2103.00820.
- Multimodal transformer networks for end-to-end video-grounded dialogue systems. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL).
- Dstc8-avsd: Multimodal semantic transformer network with retrieval style word generator. In DSTC8 at AAAI2020 workshop.
- Bridging text and video: A universal multimodal transformer for audio-visual scene-aware dialog. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2476–2483.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Entropy-enhanced multimodal attention model for scene-aware dialogue generation. In DSTC7 at AAAI2019 workshop.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Counterfactual vqa: A cause-effect look at language bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12700–12710.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Video dialog as conversation about objects living in space-time. In European Conference on Computer Vision, pages 710–726. Springer.
- Improving language understanding by generative pre-training.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
- Cmu sinbad’s submissino for the dstc7 avsd challenge. In DSTC7 at AAAI2019 workshop.
- Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, pages 510–526. Springer.
- Attention is all you need. Advances in neural information processing systems, 30.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- Huiyuan Xie and Ignacio Iacobacci. 2020. Audio visual scene-aware dialog system using dynamic memory networks. In DSTC8 at AAAI2020 workshop.
- Smsmix: Sense-maintained sentence mixup for word sense disambiguation. arXiv preprint arXiv:2212.07072.
- Selective query-guided debiasing for video corpus moment retrieval. In European Conference on Computer Vision, pages 185–200. Springer.
- Scanet: Scene complexity aware network for weakly-supervised video moment retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13576–13586.
- Information-theoretic text hallucination reduction for video-grounded dialogue. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4182–4193, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15555–15564.