MIntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations (2403.10943v4)
Abstract: Multimodal intent recognition poses significant challenges, requiring the incorporation of non-verbal modalities from real-world contexts to enhance the comprehension of human intentions. Existing benchmark datasets are limited in scale and suffer from difficulties in handling out-of-scope samples that arise in multi-turn conversational interactions. We introduce MIntRec2.0, a large-scale benchmark dataset for multimodal intent recognition in multi-party conversations. It contains 1,245 dialogues with 15,040 samples, each annotated within a new intent taxonomy of 30 fine-grained classes. Besides 9,304 in-scope samples, it also includes 5,736 out-of-scope samples appearing in multi-turn contexts, which naturally occur in real-world scenarios. Furthermore, we provide comprehensive information on the speakers in each utterance, enriching its utility for multi-party conversational research. We establish a general framework supporting the organization of single-turn and multi-turn dialogue data, modality feature extraction, multimodal fusion, as well as in-scope classification and out-of-scope detection. Evaluation benchmarks are built using classic multimodal fusion methods, ChatGPT, and human evaluators. While existing methods incorporating nonverbal information yield improvements, effectively leveraging context information and detecting out-of-scope samples remains a substantial challenge. Notably, LLMs exhibit a significant performance gap compared to humans, highlighting the limitations of machine learning methods in the cognitive intent understanding task. We believe that MIntRec2.0 will serve as a valuable resource, providing a pioneering foundation for research in human-machine conversational interactions, and significantly facilitating related applications. The full dataset and codes are available at https://github.com/thuiar/MIntRec2.0.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the Advances in Neural Information Processing Systems, volume 33, pp. 12449–12460, 2020.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
- Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning, volume 2, pp. 4, 2021.
- End-to-end speaker segmentation for overlap-aware resegmentation. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, pp. 3111–3115, 2021.
- Pyannote.audio: Neural building blocks for speaker diarization. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7124–7128, 2020.
- Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359, 2008.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, 2017.
- Efficient intent detection with dual sentence encoders. Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, 2020.
- Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
- Learning to classify open intent via soft labeling and manifold mixup. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:635–645, 2022.
- M2fnet: multi-modal fusion network for emotion recognition in conversation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4652–4661, 2022.
- Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190, 2018.
- Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 248–255, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634, 2015.
- Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941, 2016.
- Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 6202–6211, 2019.
- Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.
- Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 154–164, 2019.
- Cosmic: Commonsense knowledge for emotion identification in conversations. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2470–2481, 2020a.
- Utterance-level dialogue understanding: An empirical study. arXiv preprint arXiv:2009.13902, 2020b.
- Switchboard: Telephone speech corpus for research and development. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 517–520, 1992.
- Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9180–9192, 2021.
- Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546–6555, 2018.
- Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 1122–1131, 2020.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017.
- A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of the 5th International Conference on Learning Representations, 2017.
- Deep anomaly detection with outlier exposure. In Proceedings of the International Conference on Learning Representations, 2018.
- Few-shot learning for multi-label intent detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 13036–13044, 2021.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
- Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7037–7041. IEEE, 2022a.
- Unimse: Towards unified multimodal sentiment analysis and emotion recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 7837–7851, 2022b.
- Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 5666–5675, 2021.
- Intentonomy: a dataset and study towards human intent understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12986–12996, 2021.
- Big data algorithms and applications in intelligent transportation system: A review and bibliometric analysis. International Journal of Production Economics, 231:107868, 2021.
- SIMMC 2.0: A task-oriented dialog dataset for immersive multimodal conversations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 4903–4912, 2021.
- Integrating text and image: Determining multimodal document intent in Instagram posts. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 4622–4632, 2019.
- Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
- An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 1311–1316, 2019.
- Dailydialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, pp. 986–995, 2017.
- Automatically discovering user consumption intents in meituan. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 3259–3269, 2022.
- Expressing reactive emotion based on multimodal emotion recognition for natural conversation in human–robot interaction. Advanced Robotics, 33(20):1030–1041, 2019.
- Enhancing the reliability of out-of-distribution image detection in neural networks. In Proceedings of the International Conference on Learning Representations, 2018.
- Deep unknown intent detection with margin loss. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5491–5496, 2019.
- Discovering new intents via constrained deep adaptive clustering with cluster refinement. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8360–8367, 2020.
- Microsoft coco: Common objects in context. In Proceedings of the European conference on computer vision, pp. 740–755. Springer, 2014.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
- Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3202–3211, 2022.
- Efficient low-rank multimodal fusion with modality-specific factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2247–2256, 2018.
- Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations, 2019.
- A critical review of state-of-the-art chatbot designs and applications. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(1):e1434, 2022.
- Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp. 6818–6825, 2019.
- librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, pp. 18–25, 2015.
- Mary L McHugh. Interrater reliability: the kappa statistic. Biochemia medica, 22(3):276–282, 2012.
- Multi-modal understanding and generation for medical images and text via vision-language pre-training. IEEE J. Biomed. Health Informatics, 26(12):6070–6080, 2022.
- Attention bottlenecks for multimodal fusion. In Proceedings of the Advances in Neural Information Processing Systems, 2021.
- Expanding language-image pretrained models for general video recognition. In Proceedings of the European Conference on Computer Vision, pp. 1–18. Springer, 2022.
- MELD: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 527–536, 2019.
- A survey on spoken language understanding: Recent advances and new frontiers. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pp. 4577–4584, 2021.
- Integrating multimodal information in large pretrained transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2359–2369, 2020.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in neural information processing systems, 2015.
- Towards emotion-aided multi-modal dialogue act classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4361–4372, 2020.
- Intention, emotion, and action: A neural theory based on semantic pointers. Cognitive science, 38(5):851–880, 2014.
- Doc: Deep open classification of text documents. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2911–2916, 2017.
- Influencing social dynamics in meetings through a peripheral display. In Proceedings of the 9th International Conference on Multimodal Interfaces, pp. 263–270, 2007.
- Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 3927–3935, 2021.
- Dr. can see: Towards a multi-modal disease diagnosis virtual assistant. In Proceedings of the 31st ACM international conference on information & knowledge management, pp. 1935–1944, 2022.
- Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6558–6569, 2019.
- What is left to be understood in atis? 2010 IEEE Spoken Language Technology Workshop, pp. 19–24, 2010.
- Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4305–4314, 2015.
- Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European conference on computer vision, pp. 20–36. Springer, 2016.
- Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794–7803, 2018a.
- A bi-model based rnn semantic frame parsing model for intent detection and slot filling. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 309–314, 2018b.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 38–45, 2020.
- Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision, pp. 305–321, 2018.
- Exploiting shared information for multi-intent natural language sentence classification. In Proceedings of the 14th Annual Conference of the International Speech Communication Association, pp. 3785–3789, 2013.
- Wei Xu. Toward human-centered ai: a perspective from human-computer interaction. interactions, 26(4):42–46, 2019.
- Unknown intent detection using gaussian mixture model with an application to zero-shot intent classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1050–1060. Association for Computational Linguistics, 2020.
- CH-SIMS: A Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3718–3727, 2020.
- Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp. 10790–10797, 2021.
- Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259, 2016.
- Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1103–1114, 2017.
- Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018a.
- Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2236–2246, 2018b.
- Joint slot filling and intent detection via capsule neural networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5259–5267, 2019.
- TEXTOIR: an integrated and visualized platform for text open intent recognition. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 167–174, 2021a.
- Deep open intent classification with adaptive decision boundary. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 14374–14382, 2021b.
- Discovering new intents with deep aligned clustering. Proceedings of the AAAI Conference on Artificial Intelligence, 35(16):14365–14373, May 2021c.
- Mintrec: A new dataset for multimodal intent recognition. In Proceedings of the 30th ACM International Conference on Multimedia, pp. 1688–1697, 2022a.
- A clustering framework for unsupervised and semi-supervised new intent discovery. IEEE Transactions on Knowledge and Data Engineering, 2023a.
- Learning discriminative representations and decision boundaries for open intent detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1611–1623, 2023b.
- S3fd: Single shot scale-invariant face detector. In Proceedings of the IEEE International Conference on Computer Vision, Oct 2017.
- New intent discovery with pre-training and contrastive learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pp. 256–269, 2022b.
- Token-level contrastive learning with modality-aware prompting for multimodal intent recognition. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, 2024.
- Knn-contrastive learning for out-of-domain intent classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pp. 5129–5141, 2022.
- A probabilistic framework for discovering new intents. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pp. 3771–3784, 2023.
- Hanlei Zhang (13 papers)
- Xin Wang (1307 papers)
- Hua Xu (78 papers)
- Qianrui Zhou (6 papers)
- Kai Gao (55 papers)
- Jianhua Su (2 papers)
- jinyue Zhao (2 papers)
- Wenrui Li (25 papers)
- Yanting Chen (10 papers)