Papers
Topics
Authors
Recent
Search
2000 character limit reached

SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-modal Intent Detection

Published 31 Dec 2023 in cs.CL | (2401.00424v1)

Abstract: Multi-modal intent detection aims to utilize various modalities to understand the user's intentions, which is essential for the deployment of dialogue systems in real-world scenarios. The two core challenges for multi-modal intent detection are (1) how to effectively align and fuse different features of modalities and (2) the limited labeled multi-modal intent training data. In this work, we introduce a shallow-to-deep interaction framework with data augmentation (SDIF-DA) to address the above challenges. Firstly, SDIF-DA leverages a shallow-to-deep interaction module to progressively and effectively align and fuse features across text, video, and audio modalities. Secondly, we propose a ChatGPT-based data augmentation approach to automatically augment sufficient training data. Experimental results demonstrate that SDIF-DA can effectively align and fuse multi-modal features by achieving state-of-the-art performance. In addition, extensive analyses show that the introduced data augmentation approach can successfully distill knowledge from the LLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. “A stack-propagation framework with token-level intent detection for spoken language understanding,” in Proc. of EMNLP-IJCNLP, Nov. 2019.
  2. “A survey on spoken language understanding: Recent advances and new frontiers,” in Proc. of IJCAI, 2021.
  3. “Integrating text and image: Determining multimodal document intent in Instagram posts,” in Proc. of EMNLP, Nov. 2019, pp. 4622–4632.
  4. “Multimodal intent classification with incomplete modalities using text embedding propagation,” in Proc. of WebMedia, 2021, p. 217–220.
  5. “Multimodal intent discovery from livestream videos,” in Findings of NAACL, July 2022, pp. 476–489.
  6. “Leveraging unpaired text data for training end-to-end speech-to-intent systems,” in Proc. of ICASSP, 2020, pp. 7984–7988.
  7. “Tie your embeddings down: Cross-modal latent spaces for end-to-end spoken language understanding,” in Proc. of ICASSP, 2022, pp. 7157–7161.
  8. “Improving Spoken Language Understanding with Cross-Modal Contrastive Learning,” in Proc. of Interspeech 2022, 2022, pp. 2693–2697.
  9. “Mintrec: A new dataset for multimodal intent recognition,” in Proc. of MM, 2022, p. 1688–1697.
  10. “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. of NAACL, June 2019, pp. 4171–4186.
  11. “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. of NIPS, 2020.
  12. “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Proc. of NIPS, 2015, vol. 28.
  13. “Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis,” 2021, ICMI ’21, p. 6–15.
  14. “A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis,” in Findings of ACL, Aug. 2021, pp. 4730–4738.
  15. “Multimodal transformer for unaligned multimodal language sequences,” in Proc. of ACL, July 2019, pp. 6558–6569.
  16. “Attention is all you need,” in Proc. of NIPS, 2017, p. 6000–6010.
  17. “Misa: Modality-invariant and -specific representations for multimodal sentiment analysis,” in Proc. of MM, 2020, p. 1122–1131.
  18. “Multimodal sentiment analysis based on multi-head attention mechanism,” in Proc. of ICMLSC, 2020, I, p. 34–39.
  19. “Cma-clip: Cross-modality attention clip for text-image classification,” in Proc. of ICIP, 2022, pp. 2846–2850.
  20. “A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,” 2023.
  21. “Auggpt: Leveraging chatgpt for text data augmentation,” 2023.
  22. “Integrating multimodal information in large pretrained transformers,” in Proc. of ACL, July 2020, pp. 2359–2369.
  23. “Decoupled weight decay regularization,” in Proc. of ICLR, 2019.
Citations (3)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.