SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-modal Intent Detection
Abstract: Multi-modal intent detection aims to utilize various modalities to understand the user's intentions, which is essential for the deployment of dialogue systems in real-world scenarios. The two core challenges for multi-modal intent detection are (1) how to effectively align and fuse different features of modalities and (2) the limited labeled multi-modal intent training data. In this work, we introduce a shallow-to-deep interaction framework with data augmentation (SDIF-DA) to address the above challenges. Firstly, SDIF-DA leverages a shallow-to-deep interaction module to progressively and effectively align and fuse features across text, video, and audio modalities. Secondly, we propose a ChatGPT-based data augmentation approach to automatically augment sufficient training data. Experimental results demonstrate that SDIF-DA can effectively align and fuse multi-modal features by achieving state-of-the-art performance. In addition, extensive analyses show that the introduced data augmentation approach can successfully distill knowledge from the LLM.
- “A stack-propagation framework with token-level intent detection for spoken language understanding,” in Proc. of EMNLP-IJCNLP, Nov. 2019.
- “A survey on spoken language understanding: Recent advances and new frontiers,” in Proc. of IJCAI, 2021.
- “Integrating text and image: Determining multimodal document intent in Instagram posts,” in Proc. of EMNLP, Nov. 2019, pp. 4622–4632.
- “Multimodal intent classification with incomplete modalities using text embedding propagation,” in Proc. of WebMedia, 2021, p. 217–220.
- “Multimodal intent discovery from livestream videos,” in Findings of NAACL, July 2022, pp. 476–489.
- “Leveraging unpaired text data for training end-to-end speech-to-intent systems,” in Proc. of ICASSP, 2020, pp. 7984–7988.
- “Tie your embeddings down: Cross-modal latent spaces for end-to-end spoken language understanding,” in Proc. of ICASSP, 2022, pp. 7157–7161.
- “Improving Spoken Language Understanding with Cross-Modal Contrastive Learning,” in Proc. of Interspeech 2022, 2022, pp. 2693–2697.
- “Mintrec: A new dataset for multimodal intent recognition,” in Proc. of MM, 2022, p. 1688–1697.
- “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. of NAACL, June 2019, pp. 4171–4186.
- “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. of NIPS, 2020.
- “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Proc. of NIPS, 2015, vol. 28.
- “Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis,” 2021, ICMI ’21, p. 6–15.
- “A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis,” in Findings of ACL, Aug. 2021, pp. 4730–4738.
- “Multimodal transformer for unaligned multimodal language sequences,” in Proc. of ACL, July 2019, pp. 6558–6569.
- “Attention is all you need,” in Proc. of NIPS, 2017, p. 6000–6010.
- “Misa: Modality-invariant and -specific representations for multimodal sentiment analysis,” in Proc. of MM, 2020, p. 1122–1131.
- “Multimodal sentiment analysis based on multi-head attention mechanism,” in Proc. of ICMLSC, 2020, I, p. 34–39.
- “Cma-clip: Cross-modality attention clip for text-image classification,” in Proc. of ICIP, 2022, pp. 2846–2850.
- “A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,” 2023.
- “Auggpt: Leveraging chatgpt for text data augmentation,” 2023.
- “Integrating multimodal information in large pretrained transformers,” in Proc. of ACL, July 2020, pp. 2359–2369.
- “Decoupled weight decay regularization,” in Proc. of ICLR, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.